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SECTION  1 


INTRODUCTION  AND  SUMMARY 

1 . 1  INTRODUCTION 

This  report  Is  concerned  with  the  development  and  application 
of  improved  techniques  of  digital  signal  processing,  based  on  the 
use  of  residue  number  systems  (RNS),  to  Implement  the  processing 
functions  associated  with  isolated-word  speech  recognition.  It 
constitutes  final  documentation,  for  fiscal  year  1984,  on  MITRE 
Mission  Oriented  Investigation  and  Experimentation  (MOIE)  project 
7440:  Advanced  Architectures  for  Signal  Processors. 

Speech  recognition  is  a  computationally  intensive  application 
for  digital  signal  processing  in  which  residue  number  system  tech¬ 
niques  can  play  an  effective  role  in  reducing  the  computational 
burden,  or  equivalently,  in  Increasing  the  throughput  rate  at  a 
fixed  computational  level.  Under  this  project,  we  have  explored  the 
use  of  RNS-based  computations,  in  combination  with  systolic 
architectures,  for  the  improved  Implementation  of  speech  recognition 
algorithms.  Our  work  has  focused  on  a  particular  type  of  word 
recognition  algorithm  that  is  based  on  an  autoregressive  model  of 
the  speech  production  process  and  a  dynamic  programming  approach  to 
effecting  time  registration  between  test  and  stored-reference  speech 
patterns.  While  other  approaches  to  speech  recognition  are  certain¬ 
ly  feasible  and  the  subject  of  active  research  at  many  institutions, 
the  approach  we  have  adopted  is  representative  of  the  results  of 
many  years  of  speech  research  occurring  in  a  large  segment  of  the 
speech  processing  community.  Our  objective  was  not  to  advance  the 
state  of  speech  research,  but  rather  to  concentrate  on  the  RNS 
implementation  of  a  well-understood  and  commonly  used  method  of 
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speech  recognition.  We  expect  that  the  RNS  implementation  advan¬ 
tages  demonstrated  by  our  work  will  extend  also  to  the  processing 
functions  resulting  from  Improved  speech  recognition  research. 

1.2  SCOPE 

In  the  remainder  of  section  1  we  will  present  a  view  of  the 
fundamentals  of  speech  recognition  processing  sufficient  for  under¬ 
standing  of  our  RNS  implementation  work.  In  section  2  we  will 
present  a  self-contained  derivation  of  the  Itakura-Salto  distortion 
function  from  a  time-domain  viewpoint,  the  distortion  computation 
being  a  central  issue  in  any  speech  recognition  process.  Section  3 
is  devoted  to  a  complete  description  of  the  dynamic  tlme-warping 
(DTW)  algorithm  used  for  time  registration  between  test  and  refer¬ 
ence  patterns  and  its  RNS  implementation.  Of  particular  concern  Is 
a  technique  of  quantization  of  the  distortion  values  within  RNS, 
necessary  to  contain  the  dynamic  range  while  allowing  satisfactory 
discrimination  between  word  matches  and  mismatches.  Section  4 
discusses  RNS  implementation,  in  a  linear  systolic  array,  of  the 
sample  autocorrelation  function  estimate  upon  which  the  distortion 
computations  are  based.  Also  developed  is  the  RNS  architecture  for 
a  two-dimensional  systolic  array,  or  computational  wavefront  pro¬ 
cessor,  in  which  the  distortion  function  and  dynamic  programming 
computations  are  carried  out.  Of  particular  concern  is  the  pipe¬ 
lining  of  the  computations  to  maintain  a  high  throughput  in  the 
processor.  Section  5  summarizes  conclusions  and  presents  recommen¬ 
dations  for  further  work.  Necessary  details  of  residue  number 
systems  and  their  properties  are  contained  in  an  appendix. 


1.3  SPEECH  RECOGNITION  KUNDAMENTALS 


The  rudiments  of  a  speech  recognition  process  are  pictured  In 
figure  l.l.  In  this  process  the  Input  utterance,  which  Is  a  word  to 
be  matched  to  one  In  a  reference  library  of  stored  utterances,  is 
analyzed  In  short  blocks  of  overlapping  segments  from  which  a  spec¬ 
trogram,  or  time  versus  frequency  plot,  of  the  utterance  may  be 
constructed.  Segmentation  Into  short  blocks  allows  the  process  to 
be  viewed  as  locally  stationary,  the  time  variation  being  accommoda¬ 
ted  by  the  sequential  processing  of  overlapping  analysis  segments. 

It  is  assumed  that  a  similar  analysis  has  been  performed  on  the 
utterances  contained  In  the  reference  library.  In  both  cases, 
feature  vectors  are  compared  to  produce  a  local  measure  of  distor¬ 
tion  between  segments  of  the  test  utterance,  and  those  of  one  of  the 
reference  utterances. 

if  there  are  n  segments  of  the  test  utterance  and  m  segments  of 
the  reference  utterance  then  the  local  distortions,  based  on  a 
Euclidean  distance  metric  or  something  similar,  define  a  two-dimen¬ 
sional  grid  of  n  X  ra  distortion  values,  the  low  values  corresponding 
to  good  matches  between  analysis  segments  and  the  high  values  corre¬ 
sponding  to  poor  matches.  The  purpose  of  the  dynamic  tlrae-warplng 
algorithm  Is  to  effect  time  registration  between  the  stored  and  test 
segments  to  compensate  for  local  time  expansion  or  contraction  of 
the  test  utterance  with  respect  to  the  reference.  It  is  a  dynamic 
programming  algorithm  which  calculates  the  accumulated  weighted 
distortions  for  the  least-cost  path  through  the  grid  of  distortion 
values.  This  score  for  the  comparison  of  utterances  is  compared  In 
magnitude  with  the  scores  for  other  pairings  to  produce  a  final 
decision  as  to  which  reference  utterance  provides  the  best  match. 
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The  dynamic  time-warping  computations  and  distortion  function 
computations  impose  the  greatest  computational  burden  in  the  recog¬ 
nition  process.  Although  the  short-time  spectral  analysis,  or 
feature  extraction,  computations  can  be  quite  complex,  the  analysis 
needs  only  to  be  performed  once  for  each  segment  of  the  test  utter¬ 
ance.  Although  the  distortion  and  DTO  computations  may  not  be  as 
complex,  individually,  there  is  a  need  to  produce  an  array  of  dis¬ 
tortion  computations  for  each  pair  of  utterances  coupled  with  the 
dynamic  programming  computations  to  produce  a  score  comparing  the 
test  utterance  with  each  of  the  reference  utterances  stored  in  the 
library.  This  is  the  computational  bottleneck  in  the  recognition 
process  that  we  expect  to  Impact  with  the  combination  of  RNS  compu¬ 
tation  and  systolic  array  architecture. 

1.3.1  Preprocessing  of  the  Speech  Waveform 

Since  the  speech  recognition  process  will  involve  digital 
computations  on  overlapping  segments  of  the  analog  speech  waveform, 
preprocessing  of  the  speech  signals  is  required  to  obtain  the  appro¬ 
priate  digital  signals.  The  waveforms  must  be  appropriately  sampled 
and  quantized,  the  beginning  and  end  of  an  utterance  established, 
the  utterance  segmented  into  overlapping  blocks,  and  the  segments 
windowed  for  subsequent  spectral  processing.  The  sampling  rate  must 
be  high  enough  to  prevent  aliasing,  the  quantization  must  be  suffi¬ 
cient  for  satisfactory  digital  representation  of  the  analog  signal, 
and  the  segment  length  must  be  short  enough  to  provide  a  stationary 
sample  of  the  spectrum  yet  long  enough  to  provide  adequate  spectral 
resolution.  The  reference  patterns  to  be  stored  in  the  library  must 
be  preprocessed  in  Identical  fashion  to  avoid  artificial  distortion 
due  to  processing  differences.  In  our  experimental  work,  we  have 
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made  use  of  a  computer  simulation,  operating  on  the  MITRE  Corporate 
Research  Computer  Facility,  which  digitally  processes  speech  wave¬ 
forms  that  have  been  preprocessed  In  our  Audio  Signal  Conversion 
Laboratory  (ASCL). 

Input  processing  of  the  speech  waveforms  and  their  A/D  conver¬ 
sion  are  pictured  In  figure  1.2.  Voice  signals  are  picked  up  by  the 
microphone,  amplified,  and  passed  through  low-pass  filters  to  remove 
frequency  components  above  4  kHz.  After  equalization  to  compensate 
for  a  finite  sampling  aperture,  the  analog  signals  are  converted  to 
12-blt  digital  samples  at  an  8  kHz  sample  rate  by  an  A/D  converter. 
Output  from  the  A/D  converter  Is  either  stored  In  a  designated  file 
for  future  Input  to  the  simulation  or.  In  recognition  mode,  may  be 
Input  directly  to  the  speech  recognition  system  Implemented  In  the 
simulation. 

1.3.2  Utterance  Detection 

Detection  of  an  utterance,  as  contrasted  with  a  period  of 
silence,  Is  regarded  as  a  digital  preprocessing  function  In  our 
work.  In  our  experimentation,  we  have  based  utterance  detection  on 
observation  of  energy  statistics.  The  procedure  Is  pictured  In 
figure  1.3.  The  energy  statistic  Is  a  measure  of  the  short-time 
average  signal  energy  minus  the  long-time  (exponentially  averaged) 
signal  energy.  Three  thresholds  are  set  as  shown  In  the  figure. 

The  beginning  of  an  utterance  Is  detected  If  the  energy  statistic 
rises  above  the  START  threshold  and  remains  above  It  until  crossing 
the  HIGH  threshold.  The  end  of  an  utterance  Is  detected  If  the 
energy  statistic  falls  below  the  END  threshold  and  remains  below  It 
for  at  least  150  msec.  These  events  constitute  a  valid  utterance 
detection  If  the  length  from  beginning  to  end  Is  at  least  240  msec. 


'  K"**  ■  •  ' 


IS 


h  V*  -• 
s  ••  V 

^  if 


|1 


r.'- . 

i*  s’  S 


*.  •c. 

.Xv! 


'  . 
A 


ML . m 


6 


ENERGY 


TPl<U*lPnr* 


1.3.3  Segmentation  of  the  Utterance 


After  detection,  an  utterance  must  be  divided  Into  segments  for 
short-time  analysis.  The  segmentation  that  we  use  In  our  experimen¬ 
tation  Is  shown  schematically  in  figure  1.4.  The  analysis  Interval 
should  be  long  enough  for  good  spectral  resolution,  yet  short  enough 
to  capture  as  stationary  significant  features  of  the  utterance. 

Extensive  observation  of  speech  waveforms  has  revealed  that  the 
duration  of  stationary  speech  events  varies  over  a  wide  range.  Very 
short  events  (such  as  those  corresponding  to  the  burst  associated 
with  a  plosive  consonant)  with  a  duration  of  only  a  few  milliseconds 
and  very  long  events  (such  as  those  corresponding  to  the  production 
of  a  vowel)  with  a  duration  exceeding  350  milliseconds  may  be 
observed.  Most  stationary  speech  events  have  a  duration  in  the 
range  of  12  milliseconds  to  174  milliseconds;  the  distribution  is 
skewed  so  that  the  median  duration  (50  to  75  milliseconds)  Is  always 
shorter  than  the  mean  duration  (60  to  85  milliseconds).  Most 
systems  that  analyze  speech  do  so  on  a  fixed  time  scale  (about  20  to 
25  milliseconds)  that  Is  considerably  shorter  than  the  median  dura¬ 
tion  of  stationary  speech  events  and  without  regard  for  the  location 
of  the  event  relative  to  the  analysis  Interval.  Some  systems  employ 
overlapped  analysis  intervals  (with  an  advance  of  about  10  milli¬ 
seconds)  so  that  the  deleterious  effects  of  employing  a  fixed  time 
scale  and  ignoring  event  location  are  reduced.  We  have  chosen  to 
segment  the  utterances  Into  22.5  msec,  blocks  with  an  advance  of  10 
msec.  Thus,  each  segment  consists  of  180  samples  (at  .^n  8  kHz 
sample  rate)  with  each  segment  advanced  by  80  samples.  A  typical 
utterance  may  be  blocked  into  as  many  as  60  segments.  Each  segment 
Is  windowed  by  a  Hamming  window  function  for  purposes  of  spectral 
smoothing. 
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1.4  SPECTRAL  PROCESSING  FOR  FEATURE  EXTRACTION 


Generally,  speech  analysis  is  based  on  identification  of  spec¬ 
tral  features  that  form  a  time-varying  pattern,  or  spectrogram,  to 
distinguish  utterances.  Short-time  spectral  analysis  is  used  as  a 
means  of  dealing  with  speech  segments  over  time  intervals  in  which 
the  spectra  are  stationary,  the  concatenation  of  these  spectral 
segments  forming  the  spectrogram.  The  spectral  approach  to  speech 
analysis  is  justified  by  the  results  of  many  years  of  experimenta¬ 
tion  and  empirical  observations  and  in  the  ability  to  accurately 
model  the  vocal  tract  and  its  excitation  with  acoustical  transmis¬ 
sion-line  models.  Experimental  evidence  abounds  to  show  that  much 
of  the  Information  contained,  or  perceived,  in  a  speech  signal  is 
coded  by  the  collusion  of  a  few  formants,  or  natural  resonant 
frequencies,  of  the  vocal  tract. 

Three  methods  of  short-time  spectral  analysis  are  prevalent  in 
current  speech  recognition  research:  windowed  discrete  Fourier 
analysis,  processing  in  a  band  of  contiguous  frequency-selective 
filters  and  spectral  analysis  based  on  linear  predictive  coding 
(LPC). 

Windowed  discrete  Fourier  analysis  is  accomplished  by  computing 
a  set  of  time-overlapping  discrete  Fourier  transforms  of  finite 
length,  the  number  of  points  being  determined  largely  by  the  compu¬ 
tational  resources  available.  The  filter-bank  approach  to  spectral 
analysis  requires  the  signal  being  analyzed  to  be  processed  by  a  set 
of  bandpass  filters.  The  highest  spectral  resolution  is  normally 
provided  at  the  low-frequency  end  of  the  spectrum  and  lower  resolu¬ 
tion,  or  larger  bandwidth,  is  provided  at  the  upper  frequencies. 
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Linear  predictive  oodinp,  (LPC)  is  based  on  a  model  of  the 
utterances  produced  by  the  vocal  tract  as  a  linear  system  driven  by 
white  noise  or  a  pitch-synchronous,  almost  periodic  pulse  train. 
Based  on  Wiener's  (19A9)  work  on  linear  prediction  of  stochastic 
time-series,  a  number  of  recursive  algorithms  are  available  to 
determine  the  parameters  of  the  time-varying  linear  system,  from 
which  It  is  a  simple  matter  to  evaluate  the  system  function  In  the 
frequency  domain  to  generate  the  spectrogram.  Alternatively,  the 
LPC  parameters  can  be  used  directly  as  a  means  of  encoding  speech 
for  pattern  discrimination  and  recognition. 

The  LPC  method  Is  equivalent  to  autoregressive  spectral  estima¬ 
tion,  which  In  turn  Is  equivalent  to  maximum  entropy  spectral  esti¬ 
mation  for  Gaussian  processes.  We  have  chosen  to  use  this  method  in 
our  RNS  development  for  several  reasons.  One  is  the  successful  use 
over  many  years  of  linear  prediction  theory  for  speech,  which  is 
largely  due  to  the  natural  fit  of  an  all-pole  model  to  the  speech 
signal  during  voicing,  attributed  to  the  absence  of  zeros  in  the 
transfer  function  of  the  transmission-line  model.  Another  consider¬ 
ation  Is  the  maximum  entropy  viewpoint  which  does  not  constrain 
artificially  the  data  where  It  Is  not  observed  but  rather  produces  a 
Spectrum  that  presumes  maximum  uncertainty  for  the  unobserved  data. 
Finally,  the  computations  Involved  in  processing  the  data  Involve 
autocorrelation  estimates  rather  than  spectral  filtering  or  trans¬ 
forms,  and  these  are  well  suited  to  RNS  computation,  particularly 
with  RNS  systolic  architectures  that  have  been  developed  for  trans¬ 
versal  filtering  under  MlTRfi's  Integrated  Klectronlcs  project. 
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Autoregressive  Spectral  Estimation 


Autoregressive  spectral  estimation  of  speech  Is  treated 
thoroughly  In  the  literature,  but  Its  essentials  will  be  reviewed 
here  for  completeness  [1].  The  autoregressive  (AR)  or  linear  pre¬ 
dictive  coding  (LPC)  model  of  speech  assumes  that  speech  may  be 
modeled  as  the  output  of  a  linear  system  of  finite  order  having  only 
poles  in  its  frequency  domain  transfer  function  and  driven  either  by 
Gaussian  white  noise,  or  by  a  pitch-synchronous  periodic  signal, 
depending  on  whether  the  sound  is  unvoiced  (as  for  certain 
consonants)  or  voiced  (as  for  vowels).  The  parameters  of  the  model 
are  estimated,  from  which  It  Is  a  simple  matter  to  generate  the 
power  density  spectrum. 

The  most  common  description  of  the  AR  model  Is  In  terms  of  the 
model  gain,  a  >  0,  and  a  set  of  predictor  coefficients  {a^; 
n  =  1,  2,  ...,  P}  which  are  selected  so  that  a  monlc  P*^*^  order 
polynomial,  z*’Ap(z),  defined  by 


Ap(z)  =  y  a^z  Uq  =  I 

n=0 


has  all  its  roots  inside  the  unit  circle  z  =  With  these 
parameters  so  defined,  the  order  autoregressive,  or  AR(P), 
model  spectrum  Is  given  by 


He)  = - - 

Ap(e^  )Ap(e  H 


(l.l) 


1- V  'C*.  '.-  ' 


(1.2) 


The  requirement  that  z^Ap(z)  have  all  Its  roots  inside  the  unit 
circle  is  made  so  that  the  predictor  coefficient  sequence  is 
unique.  Note  that  for  any  root  of  z^Ap(z)  which  is  inside  the 
unit  circle,  there  is  a  corresponding  root  of  z“^Ap(z~^)  which 
is  outside  the  unit  circle.  By  swapping  corresponding  roots  between 
this  pair  of  polynomials,  one  obtains  different  sets  of  polynomial 
coefficients  without  affecting  <|/(0).  Among  these  2^  polynomials, 
the  one  with  all  its  roots  inside  the  unit  circle  is  called  a  mini¬ 
mum  phase  polynomial. 

In  addition  to  providing  a  unique  parametric  description  of  the 
AR(P)  model  spectrum,  the  minimum  phase  condition  is  Important 

in  that  it  is  equivalent  to  stability  for  the  linear  shift  Invariant 
filter  with  transfer  function 

-fCz)  =  (1.3) 

Ap(z) 

Ap(z)  is  often  described  recursively  in  terms  of  a  sequence 
of  reflection  coefficients.  Thus 

An(2)  =  \i-l(z)  +  KnZ"nAn_i(z'l);  A^Cz)  =  1  (1.4) 

for  n  =  1,  2,  ...,  P.  If  |Kj,|  <  1  and  An_i(z)  is  minimum  phase, 
then  Aj^(z)  is  also  minimum  phase.  Thus,  Ap(z)  is  minimum  phase 
if  and  only  if  every  reflection  coefficient  in  the  set  {Kj,;  n  =  1, 

2,  ...,  P}  is  less  than  one  in  absolute  value.  One  virtue  of 
reflection  coefficients  is  that  they  admit  this  simple  test  for  the 
■ninlmum  phase  condition. 


By  expanding  i|;(0)  in  a  Fourier  series,  one  obtains  the  follow¬ 


ing  pair  of  relationships  between  the  AR(P)  model  and  its  autocor¬ 
relation  coefficient  sequence. 


\1)(0)  =  ^  r 

n 


(1.5a) 


rn  =  /  (|/(0)ein0  d0 

2it 


(1.5b) 


Because  (I* (9)  Is  symmetric,  rj^  =  r-^.  By  equating  the  right-hand 
side  of  equations  1.2  and  1.5a  and  then  multiplying  both  sides  of 
the  resulting  equation  by  Ap(e^®)  one  obtains 


Ap(ei®)  I 


Ap(e-i9) 


(1.6) 


Then,  by  expanding  both  sides  of  this  equation  and  equating  coeffi¬ 
cients  of  like  powers  of  e^9 ,  one  may  derive  the  Yule-Walker,  or 
normal,  equations  expressed  in  matrix  form  as 
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Given  the  tcimonced  sequence  of  autocorrelation  coefficients  {r^,; 
n  =  0,  1,  P>  the  above  symmetric  Toeplltz  coefficient  matrix  is 

known,  and  one  may  solve  equation  (1.7)  to  determine  the  predictor 
coefficients.  *^or  this  parameter  set,  the  minimum  phase  condition 
is  equivalent  to  the  condition  that  all  principal  minors  of  the 
coefficient  matrix  have  a  positive  determinant. 

To  organize  the  distortion  function  computation,  to  be  dlcussed 
below,  it  is  useful  to  define  a  sequence  of  Inverse  correlation 
coefficients  {u^,;  a  =  0,  I,  ...,  P}  as 
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(1.8) 


These  are  related  to  the  parameters  of  the  AR  model  by  tlie  equation. 


P-n 

^  aj^ay+n/a^  (1.9) 

i=0 
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The  Itakura-Salto  Distortion  Function 


The  Itakura-Salto  distortion  function,  upon  which  our  work  Is 
based,  was  Introduced  In  1970  as  an  analysis  technique  for  making  a 
maximum  likelihood  estimate  of  the  power  spectral  density  of  an 
autoregressive  process  modeled  as  the  output  of  an  all-pole  filter 
driven  by  white  Gaussian  noise  [2].  Their  original  work  has  been 
subsequently  extended  by  a  number  of  researchers,  and  variations  of 
their  original  Idea  have  resulted  In  a  number  of  related  distortion 
measures  [3].  Below,  we  present  the  basic  formulas  that  result  when 
the  Itakura-Salto  distortion  Is  Interpreted  as  a  measure  of  spectral 
matching.  In  section  2,  we  will  present  a  different  and  self- 
contained  derivation  that  Is  developed  In  the  time  domain. 

Defining  f(0)  as  the  power  spectral  density  of  a  test  segment 
and  g(9)  as  the  power  spectral  density  of  a  reference  utterance,  the 
Itakura-Salto  distortion  may  be  expressed  as 


dis(f.g)  = 


Hn— — ^ 


(1.10) 


When  the  spectra  of  the  test  and  reference  segments  are  the  same, 
the  distortion  Is  zero  for  an  all-pole  model  spectrum  and  It  Is 
possible  to  show  by  means  of  contour  Integration  of  equation  (1.2) 
In  the  complex  plane  that 


where  xr  is  the  variance  of  the  white  Gaussian  process  driving  the 
all-pole  filter  whose  transfer  function  is  Ap(2).  The  distortion 
may  then  be  written  as 


TT 

djg  =  f  -  Inal  +  £na|  -  1  (1.12) 
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In  terms  of  the  laverse  correlation  coefficients  of  equation  (1.8), 
the  remaining  integral  may  be  rewritten  as 
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(1.13) 


The  Integral  on  the  right-hand  side  of  equation  (1.13)  Is  simply  the 
sequence  of  autocorrelation  coefficients  {r^;  n  =  0,  ±1,  ...,  ±«>} 
whose  discrete  Fourier  transform  Is  the  power  density  spectrum  f(9), 
as  In  equation  (1.5).  Then  we  have 


TT  P 

^  f(9)  d9  ^  y 
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(1.14) 


which  Is  seen  to  be  a  scalar  product  between  the  vector  of  2P  +  1 
autocorrelation  coefficients  £  with  the  vector  of  Inverse  corre¬ 
lation  coefficients  u.  It  is  a  fortunate  consequence  of  the 
autoregressive  model  that  only  2P  +  1  autocorrelation  coefficients 
of  the  test  process  are  needed  in  the  distortion  computation  (and 
they  are  symmetric).  When  the  process  actually  is  P^'^  order 
autoregressive,  these  are  sufficient  to  predict  the  remaining 
coefficients  and  hence  the  power  density  spectrum.  The  distortion 
function  may  finally  be  expressed  as 


dis(f,g) 


I  un(g)>^n(f) 

n=-P 


!ln  -  1 

4 


(1.15) 


This  is  the  form  of  the  Itakura-Saito  distortion  used  for  computa¬ 
tion.  In  section  2  we  will  discuss  determination  of  the  variances 

2^2 
Of  ana  Og. 

In  computing  the  distortion  of  equation  (1.15),  notice  that 

u,  the  vector  of  Inverse  correlation  coefficients  and  the  corre- 

2 

spending  variance  Og  for  each  reference  segment  may  be  precomputed 

from  the  normal  equations  (1.7)  and  equation  (1.9)  and  stored  in  the 

reference  library.  For  the  segments  of  a  test  utterance,  it  is 

necessary  to  determine  £,  the  vector  of  2P  +  1  autocorrelation 
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coefficients  and  the  variance  of  since  these  cannot  be  computed  in 
advance . 


1 . 5  DYNAMIC  TIME-WARPING 


Although  voice  spectrograms  provide  distinct  patterns  for 
discrimination  and  recognition,  variations  in  speaking  rate,  speaker 
inflections,  and  variations  from  speaker  to  speaker  produce  local 
time  variations  in  the  patterns  that  must  be  compensated  for  effec¬ 
tive  comparison  with  the  stored  patterns.  Dynamic  programming, 
introduced  for  speech  recognition  by  Sakoe  and  Chiba  [4],  has  been 
successfully  employed  to  bring  about  adequate  time  registration,  but 
the  computational  complexity  is  high  because  of  the  need  to  register 
the  segments  of  each  test  utterance  against  those  of  every  reference 
utterance,  including  variations  of  the  same  word,  stored  in  the 
reference  library. 

The  dynamic  time-warping  algorithm  and  its  RNS  implementation 
will  be  discussed  in  entirety  in  section  3.  A  systolic  architecture 
for  the  RNS  implementation  will  be  discussed  in  section  4. 

1.6  SUMMARY  OF  RESULTS 


The  diagram  of  a  speech  recognition  system  based  on  linear 
predictive  coding  (or  autoregressive  spectral  estimation)  and 
dynamic  tlrae-warplng  is  shown  in  figure  1.5.  This  diagram 
corresponds  to  a  computer  simulation  model  used  extensively  in  our 
RNS  implementation  studies.  (The  reader  is  referred  to  Appendix  A 
for  a  discussion  of  residue  number  systems  and  their  properties.) 


As  shown  In  figure  1.5,  the  speech  waveform  after  preprocessing 
Is  converted  to  12-blt  digital  samples  and  segmented  Into  over¬ 
lapping  blocks  of  180  samples  each  which  are  windowed  with  a  Hamming 
window  function.  Thirteen  autocorrelation  values  are  estimated  for 

each  segment.  The  LPC  parameters  are  extracted  by  solution  of  the 
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normal  equations  (1.7)  with  the  process  gain  Og  and  the  Inverse 
correlation  coefficients  Uj^  stored  In  the  reference  library.  The 
LPC  computations  are  performed  off-line  and  computed  with  floating 
point  arithmetic,  but  the  scaled  and  quantized  parameters  may  be 
stored  In  RNS  code. 

For  the  test  samples,  the  same  autocorrelation  Is  performed 
with  the  RNS  values  entered  Into  the  dynamic  time  warping  processor 
to  be  compared  with  the  reference  library  segments.  The  Itakura- 
Salto  distortions  are  computed  largely  in  RNS,  as  discussed  in 
section  3,  to  establish  the  DTW  grid.  The  array  of  60  x  60  points 
Is  a  typical  value;  actual  utterances  In  our  library  range  between 
28  and  72  segments.  The  dynamic  time  warping  path  metric  calcula¬ 
tions  are  also  carried  out  In  RNS,  as  will  be  discussed  In 
section  3. 

F.xtenslve  studies  were  made  with  this  simulation  model  to 
support  our  RNS  implementation  development  and  selection  of  RNS 
parameters.  Our  studies  have  shown  that  RNS  can  be  quite  useful  In 
Implementing  a  dynamic  tlme-warplng  based  speech  recognition 
algorithm,  although  some  significant  problems  still  remain. 


Our  architectural  studies  demonstrate  that  RNS  computation  Is 
quite  natural  for  computing  the  correlation  estimates  needed  in  the 
LPC  analysis  and  in  the  Itakura-Salto  distortion  computation.  In 
fact,  a  linear  systolic  architecture  can  make  use  of  hardware  tech¬ 
niques  already  developed  for  RNS  Implementation  of  a  transversal 
equalizer  under  MITRE's  Integrated  Electronics  project.  The  pipe¬ 
lined  architecture  will  be  discussed  In  section  4. 

We  have  found  that  the  dynamic  tlme-warplng  calculations  can  be 
profitably  carried  out  In  RNS  If  the  Itakura-Salto  distortion  values 
are  first  quantized  Into  the  range  of  a  smaller  RNS.  In  fact, 
binary  quantization  (match  or  no-match)  seems  adequate  If  the 
unquantized  distortion  values  provide  sufficient  discrimination 
between  true  and  false  word  matches.  We  developed  an  algorithm  for 
performing  the  quantization  within  RNS,  which  is  presented  In 
section  3,  with  the  hardware  Implementation  discussed  in  section  4. 

Subject  to  these  conditions,  RNS  seems  useful  both  in  computing 
the  distortion  values  and  In  accumulating  the  least-cost  path  metric 
in  the  dynamic  tlme-warplng  algorithm.  In  section  4,  It  Is  shown 
how  both  these  computations  may  be  carried  out  In  a  pipelined  two- 
dimensional  systolic  array,  with  the  data  flow  specified  and  Impli¬ 
cations  for  RNS  Implementation  of  the  hardware  discussed. 


Although  the  path  computations  for  DTW  can  be  easily  carried 
out  In  a  practlcal-slzed  RNS  with  appropriate  quantization  of  the 
distortion  function  values,  the  RNS  range  used  to  compute  the 
Itakura-Salto  distortion  Is  still  rather  large  (30  bits  equiva¬ 
lent).  The  quantization  algorithm  also  seems  more  complicated  than 
necessary.  Improvements  to  alleviate  these  conditions,  and  there¬ 
fore  to  develop  a  more  practical  method  of  RNS  Implementation,  are 
the  subject  of  continuing  work,  as  will  be  discussed  In  our  conclud 


SECTION  2 


USE  OF  THE  ITAKURA-SAITO  DISTORTION  FUNCTION 
FOR  SPEECH  RECOGNITION 


2 . 1  INTRODUCTION 


The  Itakura-Saito  distortion  was  derived  originally  for  maximum 
likelihood  estimation  of  the  parameters  of  an  all-pole  filter  used 
for  speech  synthesis  [2].  In  their  work,  it  was  used  in  an  experi¬ 
mental  system  for  speech  analysis  and  resynthesls,  and  it  demonstra¬ 
ted  good  performance  by  comparison  of  sound  spectrograms  of  the 
input  and  resynthesized  waveforms.  They  Interpreted  their  distor¬ 
tion  function  as  having  physical  meaning  for  comparing  power  density 
spectra  of  short-time  speech  records,  a  view  which  has  appealed  to 
the  speech  processing  community  and  which  has  been  expanded  in  the 
more  recent  literature  [3]. 

Unquestionably,  spectral  models  are  useful  and  have  a  long  his¬ 
tory  of  use  in  the  analysis  of  speech.  They  provide  convenient  men¬ 
tal  pictures  that  can  be  related  to  acoustic  models  of  the  vocal 
tract  and  its  excitation.  Traditionally,  speech  patterns  are  repre¬ 
sented  by  sound  spectrograms  presented  as  two-dimensional  Intensity 
plots  of  time-varying  power  spectra.  Trained  speech  researchers 
learn  to  "read"  such  plots,  demonstrating  the  ability  to  recognize 
patterns  not  readily  apparent  in  the  temporal  waveform.  It  is  not 
surprising,  therefore,  that  speech  recognition  work  is  so  heavily 
influenced  by  the  spectral  viewpoint. 


On  the  other  hand,  the  Itakura-Saito  distortion  function  Is 
well-suited  to  formulation  In  the  time  domain,  such  a  formulation 
being  relatively  simple  and  free  from  such  sophisticated  mathema¬ 
tical  notions  as  the  asymptotic  distribution  of  eigenvalues  associa¬ 
ted  with  Toeplltz  forms,  results  of  which  are  needed  In  a  rigorous 
spectral  approach  [5].  Itakura  and  Salto  modeled  speech  as  an  auto¬ 
regressive  random  process  produced  at  the  output  of  an  all-pole 
linear  filter  driven  by  white  Gaussian  noise  [6].  They  showed  that 
this  model,  with  Its  parameters  obtained  as  a  maxlmum-llkellhood 
estimate,  minimized  their  distortion  function.  In  an  efficient 
distortion  computation  based  on  a  linear  predictive  coding  model, 
the  frequency  spectra  are  not  explicitly  used,  and  In  fact,  an 
appropriate  distortion  function  can  be  derived  entirely  In  the  time 
domain.  In  an  application  such  as  word  recognition,  a  frequency- 
independent  approach  may  have  merit  In  Illuminating  the  essential 
computational  aspects.  The  temporal  approach  also  leads  naturally 
to  a  matched-filtering  Interpretation  which  suggests  an  alternative 
computational  Implementation. 

2.2  A  FKRQUKNCY-INDEPENDENT  FORMULATION 

We  begin  by  expressing  the  distortion  function  as  an  average 
log-llkellhood  ratio,  which  measures  the  dlstlngulshablllty  between 
two  random  processes  f  and  g  [7]. 

P(xlf ) 

diS  =  lf)£n  dx.  (2.1) 
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The  symbols  f  and  g  represent  zero-mean  Gaussian  random  proces¬ 
ses  corresponding  to  test  and  reference  utterances,  respectively. 

In  application  to  speech  recognition,  since  the  samples  jc  are  drawn 
from  the  process  f,  the  conditional  probability  density  functions 
are  related  by  the  Inequality  pC^clg)  <  p(xlf),  therefore  d^g  >  0. 

Samples  of  the  Gaussian  process  f  can  always  be  regarded  as 
linear  combinations  of  samples  of  a  white  Gaussian  noise  process.  In 
other  words,  as  the  output  of  a  stable  discrete  linear  system  driven 
by  samples  of  white  Gaussian  noise.  The  linear  system  may  be  des¬ 
cribed  In  general  by  the  linear  difference  equation, 

L  K 

x(n)  +  y  aj^x(n  -  ?.)  =  Of  5k.u(n  -  k)  (2.2) 

11  =  1  k=0 


where  L  <  K,  Of  Is  the  variance  of  the  Input  white  noise  process, 

Bg  =  1  and  x(n)  represents  the  sequence  of  output  samples.  For  the 
process  g,  we  assume  that  the  corresponding  linear  system  Is  des¬ 
cribed  by  the  restricted  difference  equation 

M 

s(n)  +  y  ajnS(n  -  m)  =  0gp(n). 


(2.3) 


In  (2.3),  Og  is  the  variance  of  the  Input  process  and  s(n) 
represents  the  sequence  of  output  samples.  Notice  that,  unlike 
x(n),  s(n)  depends  only  on  the  present  and  not  on  past  input  sam¬ 
ples,  and  is  referred  to  as  an  autoregressive  random  process  (the 
parameters  a^,  determine  an  all-pole  filter  In  accordance  with  the 

frequency-domain  transfer  function  corresponding  to  (2.3)).  While 
2 

Og  is  the  variance  of  the  input  process,  it  will  be  convenient  to 
regard  Og  as  the  gain  of  the  linear  system  driven  by  g(n), 

(n  =  0,  1,  2,  ...),  a  sequence  of  samples  of  unit-variance  white 
Gaussian  noise. 

We  assume  that  the  distortion  is  to  be  computed  for  a  sequence 
of  N  samples  x(n)  of  the  test  process  f,  and  that  the  comparison  is 
to  be  made  with  the  model  of  the  reference  process  g  that  is  des¬ 
cribed  by  the  linear  system  of  (2.3).  We  denote  by  x  the  column 
vector  of  N  samples  [x(0),  x(l),  ...,  x(N  -  1)]*-  and  by  ^  the 
N  X  N  matrix  [tninJ  (ensemble-average)  covariance  values  of  x. 
Similarly,  we  denote  by  ^  the  covariance  matrix  of  N  samples  of 
the  process  g  and  we  assume  that  M  <  N.  With  this  notation,  we  may 
express  the  conditional  Gaussian  probability  density  functions  as 


where  |R|  signifies  determinant (R) .  After  substitution  of  (2.4)  and 
(2.5)  into  (1),  the  distortion  may  be  expressed  as. 


d 


1  t„-l  ^  1  t_-l 
■^■xRfX  +-2X^21 


(2.6) 


where  the  overbar  signifies  the  ensemble  average  over  the  multi¬ 
variate  distribution  of  x  conditioned  on  f. 

The  second  term  of  (2.6)  may  be  expanded  via 


_ f 


N-1  Nrl  _ 

^  1  I  '  mn 

|Rf I  m=0  n=0 


(2.7) 


where  l^lmn  cofactor  of  the  element  r^j^  in  the  deter¬ 
minant  of  the  covariance  matrix  We  recognize  the  inner  sum  as 

Laplace's  expansion  of  1^1,  consequently. 


To  compute  the  ensemble  average  remaining  In  (2.9),  we  Invoke 
properties  of  the  linear  system  of  (2.3).  There  exists  a  linear 
system  that  Is  Inverse  to  that  of  (2.3)  In  the  sense  that  It  Is  a 
"whitening"  filter  for  s(n).  In  other  words.  If  the  two  systems 

are  cascaded  and  driven  by  white  Gaussian  noise,  then  the  output  Is 
also  white  Gaussian  noise.  The  unit  Impulse  response  of  the  cas¬ 
caded  system  must  be  a  unit  Impulse.  If  one  writes  out  a  few  terms 
of  the  recursion  of  (2.3)  for  x(0)  =  0,  and  ii(0)  =  1,  ii(n  >  0)  =  0, 
then  It  Is  almost  trivial  to  verify  that  the  Inverse  filter  Is  a 
finite  Impulse  response  filter  with  Its  Impulse  response  given  by 
the  sequence 

[h(0),  h(l),  ...,  h(M)]  =  [1,  aj ,  (2.10) 

If  we  let  ^  represent  an  N-vector  [v(0),  v(l) . 

v(N  -  l)]t  of  output  samples  of  the  Inverse  filter,  subject  to 
(unit-variance)  white  Gaussian  noise  Input  =*  tij(O),  ii(l),  ...» 
y(N  -  1)],  then  the  output  covariance  matrix,  ^  »  vv^,  may  be 
written  as 


^  =  HIHt  =  HHt 


(2.11) 


where  H  Is  an  N  x  N  Toeplitz  matrix  composed  of  impulse  response 
elements  h(k), 


Since  the  Inverse  filter  is  a  whitening  filter  for  £  =  [s(0), 
s(l),  s(N  -  1)]^,  we  also  have  £  =  Ite.  Consequently, 

U  ut  =  H  s  st  Rt  =  H  ^  Ht.  (2.13) 


Then,  with  reference  to  (2.11), 


(2.U) 


which  shows  the  covariance  matrix  of  the  reference  speech  model 
samples  to  be  the  Inverse  of  the  covariance  matrix  of  the  output  of 
the  inverse  filter  (when  driven  by  white  Gaussian  noise).  Now,  we 
may  write 


where  Is  to  be  detecralaed  by  (2.12)  from  the  model 

parameters  of  the  reference  process  g. 

^  Is  an  N  X  N  symmetric  matrix  with  Its  elements  determined 
as 


m<M 

^’^mn  “  hi^h|m_n|+jc 


(2.16) 


for  |m  -  n|  ^  M.  For  Ira  -  n|  >  M,  =  0.  In  terms  of  the 
linear  system  parameters,  for  |ra  -  n|  <  M, 


ra<M 

'^ran  “  ~  \  ^k^k+lm-n]  “  ~  ^mn  (2*17) 

k=0 


■t. 


.  - .  1 

.VV."! 


where,  by  definition,  ag  =  1.  It  is  convenient  to  regard  ^  as 
the  sura  of  two  matrices. 


c*.  *.  s 


system  is  eventually  stationary,  after  the  initial  transient 
settles,  but  it  is  strictly  non-statlonary  even  though  the  input  is 
stationary  [9]. 

If  we  denote  by  £  the  vector  of  2M  +  1  elements  £  =  [rj^, 
rM_i,  ...,  rj  ,  rg ,  rj  ,  ...,  rvj]  and  similarly  define  A  =  [A^, 

M-l»  »  Ag ,  A^ ,  ...,  A^] ,  then  the  bilinear  form  (2.15)  can  be 
written  in  terms  of  a  scalar  product 


_ _ f  „  „  ,  Mrl  M-2k 

x^WpX  =  (A*  r)  +  _1  )  y 


(A.£)  +  _±  )  )  (m  -  k  -  Orka^_iaji+k-l- 

a|  a2  k=0  £=1  (2.19) 


Now  we  have, 


diS  =  ^g.n  j-^  j  -  N  N-M  (A«r)  + 
2  "1^  2  2a2  -- 


(2.20) 


M-1  M-2k 

T  y  (m-k-e)rkaj^_ia£+k-i. 
2al  k=0  e=l 


For  M  «  N,  a  good  approximation  is  simply, 

djs  -  ^g.n  j  -  ^  (A  •  r). 

2  2  2a2  -  - 
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(2.21) 
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We  will  be  concerned  next  with  computation  of  the  determinants 
1^1  and  1^1*  Direct  computation  of  the  N  x  N  determinants,  by 
using  Laplace's  expansion  for  example,  would  impose  an  enormous 
computational  burden  since  N  is  typically  greater  than  a  hundred 
samples.  But,  since  |^|  =  the  parameters  of  the 

all-pole  filter  model  can  save  considerable  work.  Since  ^  = 

HHt ^  the  product  of  lower-  and  upper-triangular  matrices,  and 
det(HHt)  =  det(H)det(Ht)  =  ho^  it  follows  immediately  from 
(2.10)  that 


inlSol  =  N  £n  oi 


(2.22) 


As  noted  earlier,  Og  is  the  variance  of  the  white  noise 
process  driving  the  linear  system  of  (2.3).  It  can  be  determined 
readily  from  the  normal  equations  which  must  be  solved  to  determine 
the  system  parameters  [1]. 


Sq  Si 


^M-1  ^M-2 


(2.23) 
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be  the  normal  equations  defining  the  parameters  of  the  test  process 
(assumed  autoregressive).  Application  of  Cramer's  rule  to  (2.25) 


2 


(2.26) 


|R<">I 


|«f-»l 


where  irJ">i  Is  the  determinant  of  the  (M+1)  x  (M+1)  covariance 
matrix  and  is  the  cofactor  of  the  first  element.  In  this 

case,  rather  than  computing  an  N  x  N  determinant,  the  autoregressive 
assumption  for  the  test  process  allows  the  computation  of  the  ratio 
of  determinants  of  substantially  smaller  order,  since  generally 
M  «  N. 

Finally,  we  may  express  the  distortion  as  (approximately,  for 
M  «  N) 


1  In  d  -  N  +  JL  (A 

2  2  2a| 


r)* 


(2.27) 


If  we  define  the  normalized  covariance  coefficients  for  the 

2 

test  process  as  pi^  =  then  the  approximate  distortion 

function  (2.27)  is  expressed  as 


drs  -  i  Ind  -  N  +  (A  .  p). 

2  4  2  2,2  -  - 


(2.28) 
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If  the  processes  are  identical,  except  for  the  Input  variances,  then 
(2.28)  becomes 


!i  -  N  fi 

a?  2  ^ 


(2.29) 


which  would  vanish  for  equal  Input  variances. 

2.2.1  Gain  Normalization 


For  purposes  of  word  recognition,  we  would  like  the  distortion 
function  to  be  Independent  of  relative  differences  between  the  model 
Input  variances  of  the  test  and  reference  utterances.  We  would  like 
to  recognize  words  produced  by  the  same  LPC  model  regardless  of 
Intensity  differences.  For  development  of  such  a  distortion  func¬ 
tion,  It  is  appropriate  to  use  a  log-likelihood  ratio  based  on  nor¬ 
malized  Gaussian  probability  density  functions. 

Let 


^  =  x/of 


Rf/a2 


(2.30) 


and  consequently. 


If  y  Is  a  sample  vector  of  the  process  g,  then  it  has  a  probability 
density  function  given  by 


(2.33 


and  if  we  let 


a  =  y/og 


ig 


(2.34 


then  the  normalized  probability  density  function  is 


The  gain-normalized  distortion  function  can  be  derived  from  the 
normalized  average  log-llkellhood  ratio 


«iis 


p(6lf)ln  d6 

P(ilg)  “ 


(2.36) 


where  B  Is  the  gain-normalized  test  utterance  vector  described  In 
(2.30).  With  reference  to  the  previous  formulation,  we  observe  that 
I Sg I  =  1^1  “  1  as  a  result  of  the  gain  normalization.  Then 


,  _ , _ t  .  — „ - 1 

‘llS  =  -  7  2 


(2.37) 


Since  ^  Is  the  covariance  matrix  for  the  first  term  is  simply 
-N/2  and 


iIS 


(2.38) 


or,  In  terms  of  the  observed  test  vector  x 


From  (2.27),  we  may  write  the  gatn-normaltzed  distortion  function  as 


(A 


r) 


(2.40) 


or,  In  terms  of  the  normalized  test  correlation  coefficients  pj^. 


IS  -  ~  2  1 


(2.41) 


This  Is  the  most  convenient  form  for  computation  of  the  gain- 
normalized  Ttakura-Salto  distortion,  expressed  as  the  scalar  product 
of  two  normalized  vectors,  the  vector  ^  composed  of  the  correlation 
values  of  the  parameters  of  the  inverse  filter  model  (the  so-called 
inverse  correlation  coef f icients)  and  those  of  p  composed  of  values 
of  the  N  normalized  covariance  samples  of  the  test  process. 

Although  ensemble  averages  have  been  used  in  the  formulations  above, 
stationarlty  allows  for  computation  with  time  averages,  the  time 
averages  asymptotically  approaching  the  ensemble  averages  for  a 
sufficiently  large  number  of  samples,  equivalent  to  a  long  time 
average. 


2.3  A  MATCHED  FILTERING  [NTERPRETATION 


Let  H  represent  the  matrix  of  Impulse  response  coefficients  of 
the  Inverse  filter  corresponding  to  the  reference  process  g,  as 
given  by  (2.10)  and  (2.12).  Let  x,  the  sample  vector  drawn  from  the 
test  process  f,  be  the  Input  to  the  transposed  system  and  let 
z  be  the  vector  of  output  samples.  Then  the  Inputs  and  outputs  are 
related  by  the  matrix  equation 

z  =  H^x.  (2.42) 


The  sum  of  the  squares  of  the  output  samples  may  be  expressed  as  an 
^nner  product 


z^z  =  (H*-x) ^(H*^x)  =  x^H  H^x. 


(2.43) 


From  (2.11)  and  (2.14), 

z^£  =  x^ Sg^x. 


(2.44) 


The  term  on  the  right  hand  side  of  (2.44)  may  be  expressed  as  the 
trace  of  a  matrix  product  [7], 


=  tr[^^xx*^)j. 


(2.45) 
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Taking  the  ensemble  average  of  both  sides  of  (2.44),  we  obtain 


(2.46) 


The  vector  z,  from  (2.42),  may  be  Interpreted  as  a  result  of 
convolving  the  test  sequence  x  with  the  Inverse  filter,  or  whitening 
filter  obtained  by  LPC  analysis,  but  with  the  samples  of  x  entering 
the  filter  in  reversed  order.  The  average  Is  taken  over  an  ensemble 
of  sample  functions  x  drawn  from  the  test  process  f.  If  f  Is  a 
wlde-sense  stationary  ergodlc  process,  then  the  averaging  may  be 
accomplished  by  squaring  and  adding  the  filter  output  samples  over  a 
time  Interval  that  Is  long  compared  with  the  time  Intervals  for 
which  the  samples  of  x  are  correlated,  the  effective  coherence 
time.  This  condition  should  be  satisfied  for  M  C  N. 

In  the  expression  for  the  gain-normalized  distortion  function 
(2.37),  we  may  write 

dis  =  -  ^tr[RflRfl  +  j  tr[S|lRf].  (2.47) 


Since 


and  the  trace  Is  a  distributive  function,  the  distortion  may  also  be 
expressed  as, 


d 


(2.49) 


The  distortion  as  expressed  in  (2.49)  may  be  interpreted  as  the  re¬ 
sult  of  passing  the  difference  sequence  (B  -  ot),  the  difference  be¬ 
tween  the  normalized  test  and  reference  sample  vectors,  through  the 
normalized  inverse  system  to  obtain  the  output  vector 

£=Ht(^-a)  (2.50) 


and  then  forming  the  ensemble-averaged  inner  product 


(i  ■  «)• 


(2.51) 


For  stationary  ergodlc  processes,  this  may  be  accomplished  by 
convolving  the  sequence  (£  “  2_)i  in  reversed  order,  with  the 
normalized  LPC  model  inverse  filter  having  the  Impulse  response 
values  {l,  a^ ,  a2 ,  aj^j}  and  then  squaring  and  adding  the 

output  samples. 

A  principal  difference  with  the  matched  filter  Implementation 
is  that  the  final  distortion  value  is  reached  monotonlcally  from 
below  rather  than  being  computed  as  a  convergent  series  with  sign 
alternation  as  in  a  direct  computation  of  (2.41).  This  could  be 
important  In  controlling  the  computational  range  of  the  processor  in 


an  integer  computation.  In  practice  it  may  be  satisfactory  to 
replace  the  squaring  detector  at  the  output  with  one  that  simply 
adds  the  magnitudes  of  the  samples,  thus  compressing  the  Integer 
range  of  the  output. 

2.4  COMPUTATION  IN  A  RESIDUE  NUMBER  SYSTEM 

The  key  equations  for  computation  of  the  distortion  are  equa¬ 
tion  (2.27)  for  the  Itakura  Salto,  and  equation  (2.41)  for  the  gain 
normalized  Itakura-Saito  distortion.  Equation  (2.51)  presented  an 
alternative  for  computation  of  the  gain-normalized  distortion  func¬ 
tion,  providing  an  output  that  increases  monotonlcally  to  the  final 
value. 

In  a  computation  using  the  real  numbers  in  a  conventional 
weighted  number  system  such  as  two's-complement,  the  use  of  equation 
(2.51)  rather  than  equation  (2.41)  could  be  Important  In  containing 
the  dynamic  range  of  the  processor.  In  a  computation  using  a  resi¬ 
due  number  system  It  makes  little  difference  since,  in  RNS,  Inter¬ 
mediate  products  may  overflow  the  range  available  provided  that  the 
eventual  output  is  contained  within  the  range  of  the  RNS.  (The 
reader  is  again  referred  to  Appendix  A  for  a  discussion  of  residue 
number  systems  and  their  properties.) 

If  the  Itakura-Saito  distortion  Is  used,  then  it  will  be  neces- 

2  2 

sary  to  compute  the  logarithms  of  the  squared  gains  of  and  Og. 

2  ® 

It  can  be  assumed  that  the  reference  gain  term  inog  has  been  pre¬ 
computed,  converted  to  RNS  representation  and  stored  in  the  refer¬ 
ence  library.  Similarly,  solution  of  the  normal  equations  to  obtain 


the  vector  A  of  inverse  correlation  coefficients  (or  the  vector 
2 

u  =  A/og  of  normalized  Inverse  correlation  coefficients)  can  be 

assumed  to  have  been  precomputed,  scaled  and  converted  to  RNS  form 

before  being  stored  in  the  reference  library.  Computation  of 
2 

-Unof,  which  is  needed  in  equation  (2.27),  can  be  obtained  from 
equation  (2.26)  as 

-InoJ  =  iln|R^””^^|-lln|(R^^’^|.  (2.52) 

The  determinants  in  equation  (2.52)  can  be  computed  in  RNS  if  the 
correlation  values  are  scaled  and  converted  to  RNS,  or  are  imme¬ 
diately  available  if  they  have  already  been  computed  in  RNS.  The 
logarithm,  however,  will  necessitate  conversion  to  a  weighted  number 
system  before  computation,  with  the  result  converted  back  to  RNS. 

An  excellent  algorithm  for  logarithmic  quantization  for  numbers 
represented  in  two's  complement  form  is  contained  in  [10].  It 
results  in  a  simple  hardware  implementation.  Conversion  to  and  from 
RNS  using  mixed-radix  representation  is  described  in  Appendix  A. 

If  the  gain-normalized  Itakura-Saito  distortion  is  used  as  in 

equation  (2.41),  then  the  normalized  correlation  coefficients 
2 

=  r/of  must  be  computed.  Although  r  may  have  been  computed  in 
RNS,  reconversion  to  a  weighted  number  system  to  facilitate  the 
division  is  to  be  expected,  after  which  the  values  can  be  scaled  and 
reconverted  to  RNS.  This  additional  conversion  process  should  not 
be  distressing  since  it  occurs  only  once  for  each  correlation 
vector;  whereas  the  distortion  function  must  be  computed  for  each 
point  in  a  constrained  grid  of  points  involved  in  the  dynamic  time 
warping  algorithm,  to  be  discussed  in  section  3. 


SECTION  3 


RESIDUE  NUMBER  SYSTEM  IMPLEMENTATION 
OF  A  DYNAMIC  TIME-WARPING  BASED  SPEECH  RECOGNITION  SYSTEM 

3.1  INTRODUCTION 

The  basic  operation  performed  by  a  speech  recognition  system 
(SRS)  is  the  matching  of  an  analyzed  test  pattern  representing  the 
unknown  word  to  be  identified  against  stored  reference  patterns 
which  represent  the  words  of  the  system  vocabulary.  A  problem  that 
arises  in  this  matching  is  the  need  for  time  registration  of  the 
different  speech  patterns.  The  pattern  representing  one  production 
of  a  word  will  differ  in  length  from  the  pattern  representing 
another  production  of  the  sane  word.  Furthermore,  individual  parts 
of  a  word  may  be  stretched  or  compressed  relative  to  the  same  parts 
of  another  production  of  the  same  word.  Attempts  to  perform  linear 
time  registration  of  speech  patterns  have  been  largely  unsuccess¬ 
ful.  However,  time  registration  by  dynamic  programming  [4,11]  has 
proven  to  be  an  effective  means  for  comparing  unknown  test  patterns 
against  stored  reference  patterns  of  speech. 

A  dynamic  time-warping  (DTW)  algorithm  finds  a  shortest  path 
through  a  grid  of  points.  Each  point  of  the  grid  represents  a 
matching  of  a  selected  pair  of  short-time  segments,  or  frames,  of 
the  unknown  test  pattern  and  a  given  reference  pattern.  Associated 
with  each  grid  point  is  a  value  which  is  the  calculated  local  dis¬ 
tortion  for  the  particular  match  of  test  and  reference  frames  rep¬ 
resented  by  the  point.  Associated  with  each  path  through  the  grid 
Is  a  distance  which  is  a  weighted  sum  of  the  local  distortions  for 
grid  points  lying  on  the  path.  The  output  of  the  DTW  algorithm  is  a 
score,  the  distance  of  the  shortest  path  through  the  grid,  represen¬ 
ting  the  degree  of  dissimilarity  between  the  matched  patterns. 


Computationally,  the  most  Intensive  portion  of  a  DTW-based  SRS 
Is  the  calculation  of  the  local  distortion  measures.  This  calcula¬ 
tion  must  be  performed  for  each  point  of  the  defined  DTW  grid,  which 
typically  may  contain  several  thousand  points,  and  must  be  repeated 
for  each  grid,  that  Is,  for  each  reference  pattern  contained  In  the 
library.  The  distortion  measures  employed  In  our  work  have  been 
variants  of  the  Itakura-Salto  distortion  measure  [2].  This  measure 
has  been  selected  because  It  has  a  strong  theoretical  justification. 
Is  known  to  have  performed  well  In  existing  recognition  systems,  and 
Is  relatively  easy  to  compute.  The  basic  computation  Involved  In 
evaluating  this  measure  Is  an  Inner  product  calculation  for  a  pair 
of  vectors  of  correlation  and  Inverse  correlation  coefficients 
representing  the  test  and  stored  reference  frames,  respectively. 
Inner  product  calculations  are  well-suited  for  Implementation  In  RNS 
If  the  end  result  does  not  need  to  be  Immediately  translated  back 
from  RNS  to  conventional  arithmetic  notation.  This  particular  cal¬ 
culation  of  the  Itakura-Salto  distortion  presents  both  a  challenge 
and  a  considerable  opportunity  for  RNS  Implementation.  The  chal¬ 
lenge  results  from  the  apparent  need  to  employ  a  large  range  In  the 
RNS  calculations  to  avoid  certain  problems  of  truncation  error 
resulting  from  the  Integer  conversion  of  the  Input  data  to  the 
calculation.  The  opportunity  results  from  the  established  fact  that 
the  range  need  not  contain  the  (much  larger)  Individual  products  or 
sums  accumulated  during  the  Inner  product  calculation,  nor  even  the 
scaled  autocorrelation  and  Inverse  autocorrelation  coefficients. 
Overflow  of  the  RNS  during  the  calculation  causes  no  harm  as  long  as 
the  final  result  lies  within  the  range  of  the  RNS,  and  even  occa¬ 
sional  overflow  by  the  result  may  not  be  harmful  to  the  DTW  distance 
computation. 


Three  functions  are  required  for  dynamic  tlme-warplng:  con¬ 
struction  of  the  DTW  grid,  the  set  of  points  (l(k),j(k))  on  which 
the  DTW  path  Is  permitted  to  He;  evaluation  of  a  local  distortion 
measure  for  all  points  of  the  grid;  and  solving  to  find  the  shortest 
path  through  the  DTW  grid  from  the  point  (1,1)  to  the  piylnt  (m,n), 
where  m  Is  the  number  of  reference  frames  and  n  Is  the  number  of 
test  frames  to  be  matched.  The  shortest  path  algorithm  Is  a  special 
simple  case  of  dynamic  programming  [12].  Construction  of  the  DTW 
grid  and  the  calculations  required  to  find  the  shortest  path  through 
the  grid  are  discussed  In  section  3.2.  Calculation  of  the  Itakura- 
Salto  distortion  measure  and  Its  variants  has  been  discussed 
previously  In  section  2. 

The  remainder  of  section  3  Is  concerned  with  the  Implementation 
of  the  DTW  algorithm  In  RNS.  The  approach  discussed  In  section  3.3 
utilizes  a  two-part  quantization  of  distortion  values,  first  Into 
the  range  of  a  single  modulus,  and  then  to  a  single  bit  (match  or 
no-match).  The  quantization  algorithm  Is  described  In  section  3.4. 
RNS  Implementation  of  the  shortest-path  calculation  Is  treated  In 
section  3.5.  Range  considerations  for  the  RNS  Implementation  of 
dynamic  tlme-warplng  are  discussed  In  section  3.6.  Finally,  results 
of  simulations  of  RNS  Implementations  of  a  DTW-based  speech  recogni¬ 
tion  system  are  presented  In  the  concluding  section  3.7. 

3.2  DYNAMIC  TIME -WARPING  ALGORITHM 

The  three  functions  contained  In  the  dynamic  tlme-warplng  algo¬ 
rithm  are  Illustrated  In  figure  3.1.  In  this  section  the  determina¬ 
tion  of  the  DTW  grid  point  set,  given  the  number  of  reference  frames 
m,  the  number  of  test  frames  n,  and  a  set  of  local  and  global  path 
constraints,  Is  described,  and  the  calculations  required  for  finding 


the  shortest  path  through  the  grid  are  derived  for  a  particular 
choice  of  local  constraints.  The  unknown  test  utterance  will  always 
be  assigned  to  the  y-axis  (vertical),  and  the  reference  utterance 
will  be  assigned  to  the  x-axls  (horizontal). 

3.2.1  DTW  Path  Constraints 

Initially,  before  application  of  any  constraints,  the  DTW  grid 
(figure  3.2)  consists  of  the  ra  x  n  points  (i,j),  1  <  1  <  m, 

1  £  j  £  n.  Each  point  (l,j)  represents  the  matching  of  the  1-th 
reference  frame  against  the  J-th  test  frame.  Certain  matches  and 
sequences  of  matches  (l.e.,  paths)  may  be  unreasonable  to  make, 
however,  and  should  be  ruled  out  In  advance.  Rules  are  adopted  In  a 
speech  recognition  system  to  avoid  such  unreasonable  paths  and 
pointless  computation.  It  is  the  role  of  the  local  and  global  path 
constraints  to  define  these  rules. 

Local  path  constraints  specify  in  a  precise  manner  the  ways  in 
which  a  particular  path  point  (l(k),j(k))  can  be  reached  from  a  pre¬ 
ceding  path  point  (i(k  -  1),  j(k  -  1)).  Following  Myers  et  al. 

[11],  we  represent  allowed  local  paths  by  a  set  of  productions  from 
a  regular  grammar.  A  production  Is  a  rule  of  the  form 

P:  (aj,bj)(a2,b2)...(aL,bL)  (3.1) 

where  L  Is  the  length  of  the  production,  and  the  (a,b)'s  are  seg¬ 
ments  in  a  sequence  of  local  moves.  All  a's  and  b's  and  L  are 
assumed  to  be  (small)  nonnegative  Integers.  Using  a  production,  a 
local  path  to  the  point  (i(k),j(k))  can  be  traced  backwards  to  the 
point  (l(k  -  1),  j(k  -  1))  through  L  -  I  Intermediate  points: 


k-th  point:  (i(k),j(k)) 


s  s 

s-th  intermediate  point:  (i(k)  -  y  an,  j(k)  -  S  b. ) 

l=l  il=l 


(k  -  l)st  point: 


L  L 

(l(k  -  l),j(k  -  1))  =  (i(k)  -  V  ajj.  j(k)  -  y  bji) 

£=1  £=1 


This  representation  of  local  path  constraints  provides  a  great 
deal  of  flexibility  in  their  choice.  The  left-hand  side  of  figure 
3,3  illustrates  the  Type  3  constraints  of  Myers  et  al.  [11],  which 
are  specified  by  the  four  productions: 


T>1:  (1,0)(1,1) 

P2:  (1,0)(1,2) 

P3:  (1,1) 

P4:  (1,2) 


These  four  productions  define  four  distinct  possible  local  paths  to 
a  given  point  (l(k),j(k))  in  the  DTW  grid,  coming  from  the  points 
(i(k)  -  2,  j(k)  -  1),  (i(k)  -  2,  j(k)  -  2),  (i(k)  -  1,  j(k)  -  1), 
and  (i(k)  -  1,  j(k)  -  2),  respectively.  The  first  two  of  these 
local  paths  also  pass  through  the  intermediate  point  (i(k)  -  1, 
j(k)).  Note  that  for  any  local  path  to  be  valid,  its  starting  point 
(l(k  -  1),  j(k  -  1))  and  its  end  point  (l(k),j(k))  must  belong  to 
the  valid  point  set. 


A  zero  value  for  an  a  (b)  in  a  production  implies  that  the 
corresponding  reference  (test)  frame  is  to  be  matched  with  more  than 
one  test  (reference)  frame.  A  value  greater  than  one,  on  the  other 
hand,  results  in  one  or  more  reference  (test)  frames  being  skipped 
(not  matched)  altogether.  Thus,  paths  PI  and  P2  of  the  Type  3  con¬ 
straints  allow  a  given  test  frame  to  be  matched  with  more  than  one 
reference  frame,  while  paths  P2  and  P4  permit  the  skipping  of  a  test 
frame.  Under  these  constraints,  each  reference  frame  is  used  exact¬ 
ly  once.  Corresponding  to  the  Type  3  constraints  is  a  reflected 
version,  the  Type  3a  constraints  shown  in  the  right-half  of  figure 
3.3.  These  are  specified  by  the  four  productions: 

PI:  (0,1)(1,1) 

P2:  (0,1)(2,1) 

P3:  (1,1) 

P4;  (2,1) 

For  these  constraints,  paths  PI  and  P2  match  a  given  reference  frame 
against  more  than  one  test  frame,  while  paths  P2  and  P4  permit  a 
reference  frame  to  be  skipped,  but  each  test  frame  is  used  once  and 
only  once. 

While  there  is  no  apparent  reason  for  claiming  that  one  set  of 
constraints  will  perform  better  than  the  other,  it  seems  more 
natural  to  require  that  each  test  frame  be  matched  exactly  once, 
while  allowing  reference  frames  to  be  skipped  or  used  more  than 
once.  Thus,  we  tend  to  prefer  the  Type  3a  constraints  over  the  Type 
3.  Myers  et  al.  in  effect  tested  both  types  (along  with  a  number  of 
other  sets  of  local  constraints)  by  using  the  Type  3  constraints  but 
allowing  the  assignment  of  test  and  reference  to  the  x-  and  y-axes 
to  be  reversed.  They  found  better  results  for  the  reversed  case. 


which  corresponds  Co  using  the  Type  3a  constraints.  Furthermore, 

there  are  computational  advantages  to  requiring  that  each  test  frame 

be  matched  exactly  once;  for  then,  the  logarithm  of  the  squared 
2 

gain.  Of,  of  the  test  power  spectrum  does  not  need  to  be  evaluated 
in  determining  the  local  distortion  values.  This  is  because  each 

2 

calculated  value  of  Of  will  be  used  exactly  once  in  any  legal  path 
from  (1,1)  to  (m,n),  and  therefore  can  have  no  influence  upon  deter¬ 
mining  the  best  path. 

Associated  with  each  local  path  to  a  grid  point  (i,j)  is  a  path 
cost  which  is  a  weighted  sum  of  the  local  distortion  values  for  grid 
points  passed  through  by  the  path.  One  of  the  simplest  of  weight 
functions  takes  the  form 

w(k)  =  i(k)  -  l(k  -  1)  (3.2) 

For  this  weight  function,  the  weight  assigned  to  a  local  path  is  the 
distance  traversed  In  the  reference  direction  (l.e.,  the  sura  of  the 
a's  in  the  production  defining  the  local  path).  It  is  customary  to 
divide  the  weight  equally  among  the  segments  forming  the  path. 

Thus,  for  type  3  local  constraints,  this  weight  function  assigns 
unit  weights  to  all  path  segments,  whereas  for  the  type  3a  con¬ 
straints  a  fractional  weight  will  result  for  the  segments  of  path 
PI. 

Local  constraints  limit  the  valid  point  set  making  up  the  DTW 
grid  in  the  following  manner.  For  each  procedure  P  of  a  local  con¬ 
straint,  let  sum(a)  denote  the  sum  of  all  the  a's  and  let  sum(b) 
denote  the  sum  of  all  the  b's.  The  slope  of  the  local  path  is  given 
by  the  ratio  sum(b)/sum(a) .  Let  emax  and  emln  denote  the  maximum 
and  minimum  slopes,  respectively,  obtained  over  all  productions  P 


comprising  the  local  constraint.  If  we  draw  lines  of  slope  emln  and 
eraax  through  the  endpoints  (1,1)  and  (m,n),  the  resulting  four  lines 
define  a  parallelogram  In  the  Initial  DTW  grid  within  which  all 
valid  points  must  lie  (see  figure  3. A).  Points  Intermediate  to 
local  paths  may  lie  outside  this  parallelogram,  but  the  endpoints  of 
such  paths  must  themselves  lie  on  or  within  the  parallelogram.  In 
figure  3.4  the  parallelogram  resulting  from  the  Type  3  constraints 
of  [11]  Is  shown,  drawn  In  solid  lines,  for  an  Illustrative  example 
representing  ten  reference  frames  and  eight  test  frames. 

Global  path  constraints  were  Introduced  by  Sakoe  and  Chiba  [4] 
to  further  delimit  the  legal  point  set.  These  constraints  take  the 
form 

li(k)  -  j(k)I  <  g  (3.3) 

for  some  nonnegative  Integer  g.  They  constrain  the  DTW  path  to  lie 
within  a  corridor  of  width  2g  centered  on  a  45-degree  diagonal 
through  the  point  (1,1).  Of  course.  If  |m  -  n|  >  g,  then  the  end¬ 
point  (m,n)  cannot  satisfy  the  global  constraint,  and  no  legal  DTW 
path  can  be  found.  Thus,  In  addition  to  restricting  where  the  path 
can  lie,  the  global  constraint  can  be  used  to  rule  out  altogether  a 
search  for  the  shortest  path  whenever  the  lengths  of  the  test  and 
reference  utterances  are  too  dissimilar. 


A  choice  of  g  =  0  will  permit  no  path  unless  n  =  m,  in  which 
case  all  local  paths  must  begin  and  end  on  the  diagonal  from  (1,1) 
to  (m,m).  The  global  constraint  usually  limits  the  DTW  grid  by  cut¬ 
ting  off  the  interior  corners  of  the  parallelogram  defined  by  the 
local  constraints.  In  the  example  illustrated  in  figure  3.4,  only 
the  lower  right  corner  is  in  fact  cut  off  by  the  severe  global  con¬ 
straint  g  =  2.  The  resulting  legal  points  comprising  the  DTW  grid 
are  shown  as  solid  grid  points.  The  hollow  or  empty  points  lying 
outside  the  parallelogram  are  Intermediate  points  which  may  be 
passed  through  in  traversing  certain  local  paths  which  begin  and  end 
in  the  legal  point  set.  The  selected  local  distortion  measure  must 
be  evaluated  for  such  intermediate  points  as  well  as  for  the  points 
in  the  legal  set. 

3.2.2  DTW  Path  Computations 

Dynamic  tlme-warplng  for  speech  recognition  was  first  formu¬ 
lated  as  a  problem  in  dynamic  programming  by  Sakoe  and  Chiba  [4]. 

In  fact,  however,  the  problem  of  finding  the  best  path  through  the 

DTW  grid  reduces  to  a  special  simple  case  of  dynamic  programming 

known  as  the  shortest  route  problem.  This  problem  can  be  stated 

briefly  as  follows:  Given  a  connected  graph  with  two  distinguished 
nodes  A  and  B  and  with  a  cost  associated  with  each  arc  from  a  node  i 
to  a  node  j  of  the  graph,  find  the  path  (i.e.,  sequence  of  arcs) 
from  A  to  B  whose  summed  cost  is  a  minimum.  Algorithms  for  finding 
an  optimal  solution  to  this  problem  were  first  given  (independently) 
by  Moore  [13]  and  Dantzig  [14].  Subsequently,  Bellman  [15]  formu¬ 
lated  the  shortest  route  problem  as  a  dynamic  programming  problem. 


The  network,  or  graph,  to  which  the  shortest  route  algorithm  Is 
applied  Is  defined  as  follows:  Nodes  of  the  graph  correspond  to 
legal  points  of  the  DTW  grid,  with  the  grid  point  (1,1)  as  the  node 
A  and  the  grid  point  (ra,n)  as  the  node  8.  The  arc  costs  are  defined 
as  weighted  sums  of  local  distortions  obtained  for  matches  of  refer¬ 
ence  and  test  frames  corresponding  to  grid  points  passed  through  In 
going  from  the  grid  point  associated  with  node  1  to  that  associated 
with  node  j.  For  the  type  3  local  constraints  and  the  weight  func¬ 
tion  defined  In  equation  (3.2),  the  costs  defined  for  arcs  of  the 
network  derived  from  the  DTW  grid  have  the  form 


c(Pl)  =  c(P2)  =  di_i  j  +  d^j 

(3.4) 

c(P3)  =  c(P4)  =  dij 


where  d^^j  Is  the  local  distortion  calculated  between  the  Ith 
reference  frame  and  the  Jth  test  frame.  The  network  derived  from 
the  DTW  grid  for  the  example  given  previously  In  figure  3.4  and 
assuming  weight  function  (3.2)  Is  shown  in  figure  3.5. 


The  minimum  cost  cj j  for  any  path  to  the  node  (l,j)  is  com¬ 
puted  (under  type  3  constraints  and  weight  function  (3.2))  as 


cij  =  Min  (dij  +  Ci_; d^j  +  ci_i^j_2, 


(3.5) 


^Ij  +  <^1-1, j  ^  d£j  +  f^i-l,j  ci_2,j-2'l 


where  the  first  two  terms  of  the  minimization  are  the  cost  of  reach¬ 
ing  (l,j)  by  local  paths  P3  and  P4,  and  the  latter  two  terms  are  the 
cost  of  reaching  (l,j)  by  local  paths  PI  and  P2.  Let 


‘^ij  =  ‘lij  +  Min  (ci_i^j_l, 


(3.6) 


ci-l,j  =  ‘^l-l.j  +  Min  (ci_2,j-i,  Ci_2,j-2)  (3.7) 


cij  *  Min  (cij,  dij  +  (3.8) 


Cmn»  the  minimum  cost  for  any  path  to  node  (m,n),  is  the  score 
returned  by  the  DTW  algorithm. 

The  shortest  path  computation  for  type  3  local  constraints  and 
weight  function  (3.2)  can  be  summarized  as  follows: 


1.  Compute  the  local  distortion  d^j  from  the  test 
frame  correlation  coefficients  rjj(j)  and  the 
reference  frame  inverse  correlation  coefficients 


k  ^  •  r 


w  •  ^  •  L"* 


2.  Compute  cj^j  =  djj  +  Min  ^i-l,j-2) 

3.  Compute  cj j  =  Min  fcjj,  j  + 


We  would  like  to  employ  RNS  for  step  1,  the  computationally 
most  intensive  calculation  In  a  DTW-based  speech  recognition  sys¬ 
tem.  The  problem  which  arises  if  RNS  Is  used  is  the  magnitude 
comparisons  required  for  steps  2  and  3. 

3.3  RNS  IMPLEMENTATION  OF  THE  DYNAMIC  TIME-WARPING  ALGORITHM 

In  order  to  make  use  of  RNS  for  the  local  distortion  calcula¬ 
tions  of  a  DTW  algorithm,  It  Is  highly  desirable  to  remain  within 
RNS  for  the  entire  DTW  shortest  path  computation,  leaving  only  to 
convert  the  final  score  output  by  the  algorithm  for  thresholding  and 
comparison  with  other  scores  to  select  the  best  match.  As  we  have 
seen  in  section  3.2,  solution  of  the  shortest  path  problem  involves 
a  sequence  of  additions  and  magnitude  comparisons.  In  general, 
magnitude  comparisons  cannot  be  efficiently  performed  within  RNS. 
However,  the  magnitudes  being  compared  in  the  shortest  path  computa¬ 
tion  may  be  similar.  If  their  difference  in  absolute  value  does  not 
exceed  half  the  largest  modulus  in  use,  then  relative  magnitude  can 
be  determined  without  leaving  RNS,  simply  by  testing  the  difference 
modulo  this  largest  modulus.  As  a  first  approach  to  an  RNS  imple¬ 
mentation  we  tested  the  following  hypothesis: 


SHORTEST  PATH  DISTANCE  HYPOTHESIS 

The  absolute  differences  of  cumulative  distances 
compared  in  steps  2  and  3  of  the  shortest  path 
algorithm  will  generally  not  exceed  half  the 
largest  modulus  we  are  willing  to  employ  in  a 
residue  number  system  of  practical  size. 


This  hypothesis  was  tested  by  carrying  out  simulations  using  our  DTW 
simulation  program.  Histograms  of  differences  arising  in  the  short¬ 
est  path  calculations  were  generated.  The  magnitudes  of  these 
differences  depend  upon  the  choice  of  local  distortion  measure.  We 
looked  at  what  these  differences  typically  are  in  a  conventional 
implementation  for  each  of  three  distortion  measures  provided  for  in 
the  simulation  program:  Itakura-Saito  distortion,  Gain-Optimized 
Itakura-Salto  distortion,  and  Gain-Normalized  Itakura-Saito 
distortion  [3].  Three  cases  were  examined: 

1)  identical  test  and  reference  templates; 

2)  similar  test  and  reference  templates;  and 

3)  different  test  and  reference  templates. 

The  second  case  arises  when  the  test  and  reference  utterances  are 
different  productions  of  the  same  word;  the  third  arises  when  the 
test  and  reference  utterances  are  productions  of  different  words. 

Results  are  summarized  in  Table  3.1,  which  gives  the  smallest 
and  largest  differences  encountered  in  each  of  the  three  cases  for 
each  of  the  three  distortion  functions,  together  with  the  number  of 
divisions  required  to  give  a  reasonable  portrayal  of  the  histogram. 
The  width  of  each  division,  except  the  first,  is  half  that  of  its 
successor.  The  first  division  has  the  same  width  as  its  successor 
in  order  that  all  differences  may  be  counted  with  a  reasonable  num¬ 
ber  of  divisions.  The  distributions  for  the  Itakura-Saito  metric 
are  plotted  in  figure  3.6  for  the  cases  of  similar  words  (clear)  and 
different  words  (cross-hatched). 


Table  3.1 

Differences  Arising  In  the  Shortest  Path  Calculations 
of  the  Dynamic  Tlme-Warping  Algorithm 


Distortion 

Measure 

Templates 

Smallest 

Largest 

Divisions 

Identical 

.0006 

374.9607 

13 

I-S 

similar 

.0031 

352.5326 

13 

different 

.0017 

733.7631 

14 

Identical 

.0004 

11.8593 

11 

G-0 

similar 

.0002 

7.3737 

10 

different 

.0007 

11.2536 

11 

Identical 

.0009 

37.7742 

13 

G-N 

similar 

.0011 

23.4734 

12 

different 

.0027 

151.2043 

14 

The  number  of  divisions  needed  to  cover  the  range  of  differ¬ 
ences  was  at  least  ten  for  all  distortion  measures  employed.  It  was 
concluded  that  our  first  shortest  path  distance  hypothesis  Is  not 
valid  for  distortions  based  on  the  Itakura-Saito  distance  measure. 
Therefore,  we  considered  quantization  of  the  distortion  function 
leading  to  a  modified  distance  hypothesis. 


SECOND  DISTANCE  HYPOTHESIS 

With  appropriate  quantization  of  local  distor¬ 
tion  values,  the  absolute  differences  of  cumula¬ 
tive  distances  arising  In  steps  2  and  3  will  not 
generally  exceed  half  the  largest  modulus  we  are 
willing  to  employ  In  a  residue  number  system  of 
practical  size. 
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For  testing  this  revised  distance  hypothesis,  an  extreme 
quantization  of  distortion  function  values  to  a  single  bit 
(0  =  match,  1  =  no-match)  was  employed.  A  breakpoint  threshold  was 
chosen;  all  distortion  values  exceeding  the  threshold  were  then  set 
equal  to  1;  all  less  than  or  equal  to  the  threshold  were  set  equal 
to  0.  To  select  the  breakpoint,  we  first  compiled  histograms  of 
distortion  function  values  for  matching  similar  test  and  reference 
templates  (different  productions  of  the  same  word)  and  different 
templates  (different  words).  These  are  shown  (for  the  Itakura-Saito 
metric)  in  figure  3.7,  where,  as  before,  the  histogram  for  different 
templates  is  cross-hatched.  We  seek  a  breakpoint  that  discriminates 
well  between  the  two  different  comparisons.  On  the  basis  of  this 
study,  a  breakpoint  threshold  of  .5  was  selected. 

Histograms  were  then  compiled  of  all  finite  differences,  in 
absolute  value,  arising  in  the  DTW  shortest  path  calculations,  by 
running  many  test  patterns  against  the  entire  reference  library. 
Under  the  extreme  quantization  employed,  most  of  these  differences 
became  zero  (table  3.2);  the  nonzero  differences  were  both  rela¬ 
tively  few  in  number  and  small  in  size.  This  result  supports  the 
second  hypothesis  that,  with  appropriate  quantization  of  local 
distortion  values,  it  is  feasible  to  carry  out  the  DTW  calculations 
in  the  shortest  route  algorithm  in  a  residue  number  system  of 
practical  size. 
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Table  3.2 

Histogram  of  Differences  in  DTW  Shortest  Path  Calculations: 
One-bit  Quantization  of  Gain-Normalized  Itakura-Salto  Distortion 
Function  Values  -  Breakpoint  =  .5 


Difference  Frequency  of  Occurrence 


Quantization  within  RNS  Is  difficult.  In  effect  calling  for 
sign  determinations.  Our  solution  has  been  to  employ  a  two-part 
quantization,  first  quantizing  from  the  range  of  the  RNS  into  the 
range  of  a  single  modulus,  and  then  requantizing  to  a  single  bit, 
using  a  threshold.  The  method  developed  for  quantizing  to  the  range 
of  a  single  modulus  will  be  described  In  section  3.4. 

The  revised  shortest  path  computation  Is  as  follows: 

1.  Compute  dj^j  from  r,^(j)  and  Uf^(l) 

2.  Quantize  to  single  bit  :  0  =  match, 

1  =  no-match 

3.  Compute  c^j  =  d'jj  +  Min(ci_i j_i,  Ci_i j_2) 

4.  Compute  c^j  =  Mln(c£j, 

Figure  3.8  is  a  block  diagram  of  an  RNS  implementation  of  a  DTW- 
based  speech  recognition  system.  After  detection  of  a  test  utter¬ 
ance,  Input  values,  possibly  scaled,  are  converted  to  RNS  for  calcu¬ 
lation  of  the  test  correlation  coefficients.  Inverse  correlation 
coefficients  from  the  reference  library  are  scaled  and  converted  to 
RNS  (this  would  normally  be  done  before  storing  them  In  the 
library),  and  the  distortion  function  Is  computed  as  an  inner  pro¬ 
duct  of  vectors  In  RNS.  The  output  of  this  calculation  is  quantized 
In  two  steps,  first  to  the  range  of  a  single  modulus,  and  then  to  a 
single  bit.  The  shortest  path  through  the  DTW  grid  Is  obtained  In 
an  RNS  of  reduced  size  as  described  In  section  3.5.  The  resulting 
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score  Is  reconverted  to  a  conventional  number  system  for 
thresholding  and  selection  of  the  winning  text.  In  section  3.6 
questions  of  RNS  range  and  scaling  are  discussed  briefly.  Simula¬ 
tion  results  for  an  RNS  Implementation  of  the  SRS  of  figure  3.8  are 
contained  in  section  3.7. 

3.4  QUANTIZATION  IN  A  RESIDUE  NUMBER  SYSTEM 

3.4.1  Introduction 


Our  simulation  experiments  support  the  hypothesis  that,  with 
appropriate  quantization  of  local  distortion  values,  it  Is  feasible 
to  perform  the  DTW  shortest  path  calculations  within  a  practical¬ 
sized  residue  number  system  (RNS).  However,  this  raises  the 
question  as  to  how  an  appropriate  quantization  of  these  values  Is  to 
be  obtained  without  first  leaving  RNS.  We  propose  a  two-phase  quan¬ 
tization,  first  from  the  range  of  the  RNS  to  the  range  of  a  single 
modulus,  and  then  to  a  single  bit.  In  this  section,  we  show  how  to 
perform  the  first  quantization  In  RNS. 

At  first  glance,  quantization  of  values  within  RNS  appears  to 
be  a  formidable  problem  requiring,  in  effect,  a  series  of  magnitude 
comparisons  or  sign  determinations.  In  fact,  however,  the  problem 
may  be  greatly  simplified  because,  at  least  in  some  applications, 
there  Is  some  tolerance  for  error.  Quantization  divides  the  range 
of  an  observed  variable  Into,  say,  k  intervals,  and  replaces  each 
value  by  the  index  of  the  Interval  within  which  It  lies.  The  break¬ 
point  dividing  Interval  1  from  Interval  1  +  1  is  somewhat  arbitrary, 
and  it  Is  expected  that,  at  least  In  our  speech  recognition 
application,  very  little  harm  will  come  from  errors  made  In  the 
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neighborhood  of  the  breakpoint  which  cause  a  value  lying  in  the  1-th 
Interval  to  be  erroneously  recorded  as  lying  In  the  (1  +  l)-st 
Interval  or  vice  versa.  Based  upon  this  tolerance  of  errors  in  the 
breakpoint  neighborhood,  we  have  devised  a  method  for  quantizing  the 
values  in  the  range  of  an  RNS  into  the  range  of  a  single  modulus 
ra^,  in  effect  scaling  by  =  M/m^,  where  M  =  mjm2...mu  is 
the  product  of  the  n  moduli  mj,  assumed  to  be  relatively  prime  in 
pairs,  which  comprise  the  RNS.  The  method  can  be  employed  for  all 
sets  of  relatively  prime  moduli,  but  involves  some  calculation. 

The  remainder  of  this  section  is  divided  into  three  parts 
dealing,  respectively,  with  the  quantization  function  t(x)  ,  the 
calculations  required  to  evaluate  t(x),  and  the  choice  of  moduli. 

3.4.2  The  Quantization  Function  t(x) 

Any  integer  x  can  be  represented  in  a  residue  number  system  by 

n 

X  =  y  raj  |x/m<|-  -  MA(x)  (3.9) 

i  =  l  i 


where  Ixl^,  =  x  -  m[x/m] ,  [y]  denotes  the  greatest  integer  con¬ 
tained  in  y,  and  A(x)  is  an  integer-valued  function  first  studied  by 
Aiken  and  Semon  [16]  and  whose  range  is  discussed  in  [17],  Appendix 
A.  Suppose  we  divide  our  moduli  into  two  groups  which  we  call 
p  s  and  q  s ,  where  p^  —  m^ ,  P2  —  m2 ,  q^  —  m^ ,  q2  —  m^ ,  . . . , 

9n-2  =  “n.  and  define  P  =  Pi P2 ,  Q  =  qiq2***9n-2>  and 

qi  “  Q/qi*  Consider  now  the  two  RNS  defined  by  the  p's  and 

q's.  We  can  express  x  in  the  first  system  as 


X  =  P2lx/P2lp^  +  Pilx/Pilp2  -  PAp( 


and  -X  in  the  second  system  as 

-X  =  qil-x/qilqj^  -  QAq(-x). 

Let  a(x)  =  P2lx/p2lp^  +  Pilx/pjp^  +  qil-x/q^l 

Then,  by  adding  (3.10)  and  (3.11) 

o(x)  =•  PAp(x)  +  QAq(-x) 

Define  Q  =  |-1/Q|p 


Note  that  Sj  and  ^^2  can  be  calculated  within  the  RNS.  Finally,  let 

t(x)  =  ls2(x)  -  Si(x)tp^.  (3.17) 

What  will  the  function  t(x)  look  like? 

First  consider  the  function  Aq(-x).  We  have  A(kM  +  x)  = 

A(x)  -  k  (see  reference  [18]),  and  for  0  <  x  <  M  we  have 
0  _<  A(x)  <  n,  the  number  of  moduli  (see  reference  [17],  Appendix 
A).  Suppose  first  that  the  q-set  consists  of  a  single  modulus. 

Then  -Aq(-x)  =  -1  for  0  <  x  Q,  -Aq(~x)  =  -2  for  Q  <  x  £  2Q, 
and  so  forth.  The  quantities  s^ (x)  and  S2(x)  remain  constant  over 
any  span  kQ  <  x  _<  (k  +  1)Q;  both  decrease  by  one  unit  at  the 
transition  of  x  from  the  value  (k  +  l)Q  to  (k  +  1)Q  +  1.  However, 
their  difference,  modulo  p^,  does  not  change  except  at  every  P2”th 
such  transition.  The  first  such  change  occurs  for  x  =  p2  •  Q  +  1, 
when  -Aq(-x)  becomes  -(p2  +1).  A  similar  change  will  occur  every 
P2Q  values  thereafter.  Thus,  when  the  q-set  consists  of  a  single 
modulus,  t(x)  takes  on  the  pj  values  0,  I,  ...,  p^  -  1  in  some  order 
In  blocks  of  length  P2  •  Q  =  mj . 

Next,  consider  the  case  where  the  q's  consist  of  two  moduli,  q^ 
and  ^2 •  Then  -Aq(-x)  takes  on  the  two  values  -1  and  -2  In  the 
range  0  <  x  £  Q,  the  two  values  -2  and  -3  in  the  range  Q  <  x  <  2Q, 
and  so  forth.  The  quantities  s^  and  S2  can  vary  by  one  unit  over 
the  range  kQ  £  x  <  (k  +  1)Q,  but  their  difference  Is  normally 
constant  modulo  p^ .  Both  decrease  by  one  unit  at  the  transition 
points  but,  again,  the  difference  modulo  pj  Is  unchanged  except  at 
every  P2“th  transition,  and  at  certain  values  In  the  last  sub-block 
of  length  Q  before  such  a  transition,  namely,  those  values 


X  +  (kp2  -  1)Q  for  0  <  X  ^  Q  for  which  Aq{-x)  =  2  or,  equiva¬ 
lently,  those  values  for  0  ^  x  <  Q  for  which  Aq(x)  =  1.  For  these 
values  t(x)  turns  too  soon,  creating  an  ambiguous  neighborhood  at 
the  transition  point.  The  length  of  this  ambiguous  neighborhood  Is 
known  [17].  Let  j  be  the  unique  integer  In  [0,  Q  -  1]  satisfying 
l-j/qilqj^  =  1  for  1  =  1,  ...,  n  -  2.  Then  the  ambiguous 
neighborhood  has  size  j  +  1. 

Example  1 ;  p^  =  4,  p2  =  9,  qi  =  5,  q2  =  7 

Then  j  =  23  and  we  should  find  four  blocks  of  length  315,  with 
an  ambiguous  area  of  length  24  at  the  end  of  each  block.  See  table 

3.3. 

Table  3.3 

t(x)  for  4,  9,  5,  7  Residue  Number  System 


X 

t(x) 

1 

291 

1 

292 

- 

315 

1  or  2 

316 

- 

606 

2 

607 

- 

630 

2  or  3 

631 

- 

921 

3 

922 

- 

945 

3  or  0 

946 

- 

1236 

0 

1237 

- 

1260 

0  or  1 

The  range  of  the  system  above  1236  should  not  be  used,  as 
errors  which  are  made  here  would  map  a  value  which  belongs  In  the 
last  Interval  Into  the  first  Interval.  Other  errors  are  presumed  to 
be  relatively  harmless.  The  size  of  the  ambiguous  neighborhood 
depends  only  on  the  q's. 

Finally,  consider  the  case  where  there  are  more  than  two  q's. 

In  this  case,  Aq(x)  may  take  on  values  greater  than  1  In  the  In¬ 
terval  0  <  X  <  Q.  If  k  Is  the  largest  value  of  Aq(x)  In  [0, 

Q  -  1] ,  and  j  Is  defined  as  before,  then  the  ambiguous  neighborhood 
has  size  (k  -  1)Q  +  j.  The  size  attainable  by  k  Is  treated  further 
In  (17J. 

Example  2;  pj  =  7,  p2  =  11,  q^  =  2,  q2  =  3,  q^  =  5 

For  this  example  k  *  1  and  j  =  29.  We  expect  blocks  of  length 
330,  with  an  ambiguous  range  of  length  30  at  the  end  of  each  block. 
See  table  3.4 


Table  3.4 


t(x)  for  7,  11,  2,  3,  5  Residue  Number  System 


X 


t(x) 


1 

- 

300 

4 

301 

- 

330 

4 

or 

1 

331 

- 

630 

1 

631 

- 

660 

1 

or 

5 

661 

- 

960 

5 

961 

- 

990 

5 

or 

2 

991 

- 

1290 

2 

1291 

- 

1320 

2 

or 

6 

1321 

- 

1620 

6 

1621 

- 

1650 

6 

or 

3 

1651 

- 

1950 

3 

1951 

- 

1980 

3 

or 

0 

1981 

- 

2280 

0 

2281 

- 

2310 

0 

or 

4 

Example  2  (continued): 

The  range  above  2280  should  not  be  used,  as  an  error  occurring 
here  would  map  a  value  belonging  to  the  last  Interval  into  the 
first.  Notice  that  the  quantized  values,  t(x),  of  table  3.4  are 
Inappropriately  ordered. 

Since  from  (3.12)  a(0)  =  0  (cf.  equations  3.15  to  3.17),  we 
have  t(0)  =  0,  and,  since  there  is  a  transition  when  x  passes  from  0 
to  1,  t(l)  *  0.  Using  this  fact,  we  can  transform  the  t(x)  values 
Into  an  ordered  sequence  (1,  2,  ...,  pj  -  I,  0)  by  a  premultipli- 
catlon  by  |t(l)“Mpj-  For  example  2,  we  have  t(l)  =  4  and 
lt(l)”Mpj  =■  II/4I7  *  2,  resulting  in  table  3.5  in  place  of  table 
3.4: 
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3.4.3 


Calculations 


Calculation  of  Sj (x)  requires  a  simple  table  lookup  for  n  -  1 
quantities  which  are  summed  In  a  modulo  pi  adder;  calculation  of 
S2 (x)  requires  a  simple  table  lookup  for  n  -  1  quantities  which  are 
summed  In  a  modulo  P2  adder.  Calculation  of  t(x)  requires  one 
additional  summation  (subtraction)  in  the  modulo  pj  adder. 

From  (3.12)  (3.14)  and  (3. 15)  we  have 

n-2  _ 

si(x)  =  |qa(x)|p^  =  jlqlpjxlp^  +  iQ^l  Ipj  l-’^/qi  Iq^lp^ 
n-2 

=  llQ’^lpi  +  l-l/^llpi  I'x/^llqilpjp^ 


Similarly, 


n-2 

S2(x)  =  |lQxlp  +  y  I  I'l/'lllpol-x/qilq.  L  I  • 

^  1=1  ^  4^2  '  P2 


(3.19 


The  quantities 


ai  -  ||-l/qilpj-x/qilqjp^ 


(3.20 


and 
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bi  =  ll-i/qilpgl-x/qilqil 


can  be  precomputed  for  the  values  Ixlq^^  and  stored  in 
(hard-wired)  tables. 

Example  3:  pj^  =  P2  “  5,  q^  =  7,  q2  =  9. 

Then  Q  =  13,  a^  =  llSxlyl^,  a2  =  UlSxIgl^,  bj  =  UlSxl^lg, 
b2  =  llSxIglg.  The  required  lookup  tables  are  Illustrated  in 
table  3.6 


Table  3.6 


Tables  for  Quantization  in  4,  5,  7,  9  RNS 


0 

0 

0 

0 

0 

3 

3 

1 

0 

3 

2 

3 

2 

1 

1 

2 

2 

4 

1 

4 

1 

2 

0 

2 

2 

1 

1 

2 

2 

- 

Im:. 


The  last  column  provides  a  table  for  ISxlg,  needed  for  computing 
S2(x).  Since  |Q|^  =  1,  no  table  Is  needed  In  this  example  to  obtain 
|Qx|4  =  Ixl^.  These  tables  should  be  easy  to  Implement.  For 
example.  If  |x|g  Is  given  In  binary  form,  the  b2  value  can  be 
obtained  In  this  example  simply  by  dropping  the  last  bit. 

Figure  3.9  Is  a  schematic  block  diagram  depicting  the 
calculation  of  t(x)  from  the  n  residues  |x|p^,  Ixlp^,  |x|q  , 

...,  Ixlqjj_2*  quantities  iQxlp^,  iQxlp^,  aj^,  and  b^ 

(k  =  1,  ...,  n  -  2)  are  obtained  from  the  residues  by  hard-wired 
table  lookups  and  fed  to  the  two  modular  adders,  producing  the  quan¬ 
tities  Sj (x)  and  S2(x),  whose  difference,  modulo  pj,  defines  t(x). 
The  modular  adders  In  figure  3.9  are  assumed  to  be  (n  -  l)-lnput 
adders.  If  only  two-input  adders  are  available,  the  summations 
producing  t(x)  require  n  -  1  stages,  as  follows: 

Step  1:  Add  iQxlp^  and  a^  In  the  modulo  pj  adder; 

add  iQxlp^  and  bj  In  the  modulo  P2  adder. 

Step  k:  Add  a;^  to  the  contents  of  the  modulo  pj 

adder;  add  b;^  to  the  contents  of  the 
modulo  P2  adder. 

Step  n-2:  Add  afj-2  to  the  contents  of  the  modulo  pj 
adder  and  output  the  result,  Sj(x);  add 
bjj-2  to  the  contents  of  the  modulo  P2 
adder  and  output  the  result,  S2(x). 

Step  n-1:  Subtract  Sj (x)  from  S2(x)  In  a  modulo  pj 
adder,  producing  the  desired  result  t(x). 
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Quantization  by  t(x)  produces  equally  spaced  Intervals.  If 
unequal  intervals  are  desired,  some  additional  computations  may  be 
required  to  map  the  t(x)  values  into  a  reduced  set.  Modular  reduc¬ 
tion  of  t(x)  may  also  be  required  to  obtain  its  modular  representa¬ 
tion,  unless  p^  is  the  smallest  of  the  n  moduli  m^. 

3.4.4.  Choice  of  Moduli 


From  a  given  set  of  n  relatively  prime  moduli  m^^  comprising 
an  RNS,  any  given  single  modulus  can  be  chosen  as  pj ,  and  any 
remaining  modulus  can  be  chosen  as  P2 .  The  designation  of  p^  and  P2 
determines  the  size  of  the  modular  adders  required  in  the  computa¬ 
tion  of  t(x)  but,  since  these  adders  are  required  in  any  event,  it 
should  not  influence  the  choice  of  pj  and  P2 •  The  selection  of  pj 
is  primarily  influenced  by  the  particular  quantization  desired; 
the  larger  pi  is,  the  more  flexible  are  the  choices  for  the  ultimate 
quantization.  On  the  other  hand,  if  p^  is  the  smallest  of  the  n 
moduli,  it  may  be  possible  to  avoid  modular  reduction  of  the  quan¬ 
tized  values. 

The  size  of  the  ambiguous  range  in  each  block  is  determined  by 
the  choice  of  the  q  s.  It  is  less  than  (but  of  the  order  of) 

(n  -  3)Q,  whereas  the  block  size  itself  is  P2Q*  The  larger  the 
value  of  P2 ,  the  greater  the  error-free  portion  of  each  block.  It 
may,  therefore,  be  desirable  to  choose  P2  to  be  the  largest  of  the  n 
moduli,  but  this  choice  is  not  critical.  However,  P2  should 
probably  be  somewhat  larger  than  n  -  3. 


84 


3.4.5 


Conclusion 


A  method  has  been  gl’'en  for  providing  quantization  within  a 
residue  number  system  from  values  (almost)  anywhere  In  the  range  of 
the  system  Into  the  range  of  a  single  modulus.  An  alternative,  of 
course,  Is  to  perform  a  translation  Into  a  mixed  radix  number  system 
of  each  value  x  to  be  quantized.  This  calls  for  a  roughly  equiva¬ 
lent  amount  of  work  (n  -  1  table  lookups  and  subtractions),  but  does 
not  In  Itself  produce  the  desired  quantization,  requiring  a  further 
division  (or  equivalent  operation)  upon  the  translated  value,  and  a 
possible  reconversion  of  the  result  to  RNS.  It  Is  felt  that  there 
Is  a  definite  advantage  to  remaining  within  RNS  for  this  computa¬ 
tion.  This  rather  gross  quantization  appears  to  be  appropriate  for 
mapping  the  results  of  local  distortion  computations  Into  a  small 
range  suitable  for  performing  the  DTM  shortest  path  calculations 
within  RNS. 

The  method  can  easily  be  extended  with  the  addition  of  more 
modular  adders  to  allow  quantization  to  the  range  of  a  subset  of  the 
moduli  rather  than  to  the  range  of  a  single  modulus.  To  map  values 
from  range  M  to  range  ^  *  Pi  »  P2 »  Pk-1»  where  our  RNS 

moduli  have  been  divided  Into  two  sets  with  k  and  n-k  members, 
respectively,  we  replace  (3.10),  (3.11),  and  (3.12)  by 

k 

X  =  ^)!^PiU/pilp^  -  PAp(x)  (3.22) 

n-k 

-X  =  ^tl'x/qilq^  -  QAq(-x)  (3.23) 

k  ^  n-k 

=  .1,  Pllx/Pllp.  +  ,1,  qil-x/qilq4  (3.24) 


»  *  •  •  ^  k 

-.T- 


and  replace  and  P2  by  pj^  and  pj^,  respectively.  In  equations 
(3.15),  (3.16),  and  (3.17).  To  calculate  Sj (x)  and  t(x)  in  RNS,  we 
simply  get  a  modular  representation  in  terms  of  the  pj.  Thus,  we 
use  k-1  adders  to  calculate  s^ (x)  and  t(x),  and  one  more  to 
calculate  S2 (x) . 

3.5  RNS  IMPLEMENTATION  OF  THE  SHORTEST  PATH  ALGORITHM 

Although  a  somewhat  large  range  is  required  for  an  RNS 
implementation  of  the  Itakura-Salto  distortion  function  calculation 
(as  will  be  seen  from  the  discussion  in  section  3.6),  the  shortest 
path  calculation  Itself  can  be  successfully  performed  in  an  RNS  of 
much  smaller  range.  Two  types  of  potential  overflow  must  be 
considered.  First,  the  magnitude  comparisons  of  steps  3  and  4  of 
the  revised  shortest  path  computation  of  section  3.3  are  to  be 
performed  in  the  largest  residue  channel.  An  overflow  will  result 
If  the  difference,  in  absolute  value,  between  the  cumulative  path 
distances  under  comparison  exceeds  half  the  largest  modulus 
employed.  Second,  the  cumulative  distances  themselves  are 
represented  by  residues  in  all  channels  used.  An  overflow  results 
if  the  cumulative  distance  for  the  presumed  best  path  exceeds  the 
range  of  the  RNS.  This  is  a  serious  error,  generally  leading  to  a 
recognition  error,  for  the  resulting  path  score  will  be  much  lower 
than  its  true  value. 

We  consider  first  this  latter  overflow  possibility.  Under 
weight  function  3  of  equation  (3.2),  the  cumulative  cost  for  any 
path  cannot  exceed  the  number  of  reference  frames.  While  alterna¬ 
tive  weight  functions  can  give  higher  costs,  the  most  expensive 
weight  function  we  have  employed  results  In  paths  whose  cumulative 
cost  cannot  exceed  the  sum  of  the  number  of  test  frames  and  the 
number  of  reference  frames.  Since  the  longest  utterance  in  our  test 


library  consists  of  72  frames,  an  RNS  of  range  144  or  greater 
suffices  for  the  shortest  path  computation  (under  one-bit  quantiza¬ 
tion  of  distortion  values).  In  a  typical  simulation  run  of  sixty 
test  cases  against  the  entire  library,  the  largest  path  score 
produced  was  46,  resulting  from  matching  a  72-frame  "four"  and  a 
70-frame  "five." 

We  plan  to  employ  the  two  largest  moduli  from  the  set  used  for 
the  distortion  calculations  as  an  RNS  for  the  shortest  path  calcula¬ 
tions.  For  most  of  our  simulations  this  has  been  the  pair  of  primes 
73  and  71,  providing  a  range  much  greater  than  needed  for  this 
computation. 

The  first  type  of  overflow,  though  less  harmful.  Is  much  more 
likely  to  occur,  and  care  should  be  taken  to  ensure  that  the  first 
residue  channel  Is  of  sufficient  size.  Table  3.7  shows  the  number 
of  occurrences  of  errors  of  the  first  type  for  various  choices  of 
RNS2  for  the  DTW  path  computations.  In  all  cases  shown,  the  corre¬ 
lation  and  Itakura-Salto  distortion  function  computations  were 
performed  using  an  RNSl  composed  of  the  five  prime  moduli 
{73,71,67,61,59},  and  quantization  of  the  distortion  values  was 
performed  in  two  stages,  first  to  the  range  (0,72),  and  then  to  a 
single  bit  (match  or  no-match)  using  a  threshold  value  T  =  16.  The 
last  column  of  the  table  shows  the  recognition  error  rate  for 
simulations  performed  using  our  sixty-word  test  set  consisting  of 
different  productions  of  the  ten  digits  and  "oh"  (The  setting  of  the 
global  constraint  always  caused  one  error  (1.7  percent)  among  the 
sixty  test  cases). 


For  the  last  two  RNS  {7,5}  and  {5,3}  the  RNS2  range  Is  exceeded 
much  of  the  time  by  the  cumulative  distances.  For  the  remaining 
cases,  the  range  Is  never  exceeded  by  the  cumulative  distances.  The 
results  displayed  In  table  3.7  support  the  hypothesis  that  little 
harm  results  from  occasionally  overflowing  the  largest  modulus  In 
the  path  comparisons,  provided  that  the  number  of  overflows  Is  not 
excessive.  No  degradation  In  recognition  performance  was  observed 
until  the  largest  modulus  was  reduced  to  11,  when  the  number  of 
overflows  exceeded  12000.  No  overflows  occur  when  the  largest 
modulus  Is  25  or  greater. 

Table  3.7 


Number  of  Overflows  of  First  Modulus  and  Recognition  Error  Rates 
for  Various  RNS2  Choices  for  DTW  Calculations  -  One-Bit 
Quantization  of  Distortion  Values  -  RNSl  -  {73,71,67,61,59}; 

Threshold  ■  16 


Moduli  Number  of  Overflows  Recognition  Error  Rate 


25,23 

0 

1.7 

23,21 

8 

1.7 

21,19 

44 

1.7 

19,17 

314 

1.7 

17,15 

364 

1.7 

15,13 

976 

1.7 

13,11 

4557 

1.7 

11,9 

12390 

6.7 

9,7 

38323 

25.0 

7.5 

110438 

61.7 

5,3 

294523 

95.0 

3.6  RNS  RANGE  AND  SCALING 

The  Itakura-Salto  distortion  measure  calculation  has  the  form 

djs  =  scalar  +  (3.25) 

n 

The  range  of  the  RNS  to  be  employed  must  be  sufficient  to  contain 
the  end-product  of  this  calculation.  Occasional  overflow  of  the  RNS 
by  the  calculated  distortion  function  would  probably  not  cause  much 
harm;  frequent  overflow  could  affect  recognition  accuracy  adversely; 
constant  overflow  would  destroy  the  Information  contained  In  the 
distortion  measure  and  render  It  useless  for  speech  recognition. 

We  assume  that  the  test  correlation  coefficients  r^  are  com¬ 
puted  In  RNS.  It  la  possible  to  perform  some  scaling  down  of  the 
test  utterance  sample  values  before  calculation  of  the  r^.  The 
reference  Inverse  correlation  coefficients  are  normally  small  frac¬ 
tional  values,  and  must  be  scaled  up  before  conversion  to  Integer 
values  and  RNS  representation.  Since  this  conversion  will  entail 
truncation  error.  It  must  be  performed  with  some  care. 

If  the  test  input  sample  values  are  scaled  by  the  factor  2”'^, 
then  the  values  are  scaled  by  2~2k.  if  then  the  reference 

inverse  correlation  coefficients  are  scaled  up  by  a  factor  2^,  the 
"scalar"  must  be  scaled  up  by  a  factor  2^”^^.  The  end  result  Is 
that  the  distortion  function  is  scaled  up  by  For  example. 

If  we  scale  the  test  Input  sample  values  by  2~^  and  the  reference 
inverse  correlation  coefficients  by  2^^,  the  result  is  to  scale  the 
distortion  function  by  2  .  Since  we  have  seen  (figure  3.7) 

unsealed  Itakura-Salto  distortions  of  the  order  of  2^,  we  would 
expect  to  need  an  RNS  range  of  about  2^®  to  contain  this  calcula¬ 
tion. 
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In  the  RNS  simulations  reported  In  section  3.7,  twelve-bit 
Input  sample  values  were  used  for  the  test  utterances.  After 
windowing  (Hamming  window)  and  RMS  normalization,  these  were  recon¬ 
verted  to  Integer  values  and  clipped  at  ±2047.  Before  conversion  to 
RNS,  these  values  can  be  scaled  down  a  little,  but  not  much  or  they 
will  Incur  truncation  errors  large  relative  to  their  size. 

We  looked  at  the  unsealed  Inverse  correlation  coefficients. 

The  largest  values  (for  12-blt  Inputs)  tend  to  lie  In  the  range  10“® 
to  10“^.  These  must  be  scaled  up  before  conversion  to  Integers  and 
RNS.  We  would  probably  like  to  scale  these  up  by  something  like  10® 
to  10®  to  make  them  comparable  In  size  to  the  unsealed  test  correla¬ 
tion  coefficients,  but  must  take  care  that  the  resulting  scaled 
distortion  values  not  exceed  the  RNS  range  more  than  occasionally. 

In  the  next  section,  we  show  simulation  results  employing  various 
scalings  for  the  five-modulus  RNS  (73,  71,  67,  61,  59),  with  range 
approximately  1.25  «  10®,  or  about  1.16  x  2®®. 

3.7  SIMULATION  RESULTS  FOR  IMPLEMENTATION  OF  DTW  ALGORITHM  IN  RNS 

3.7.1  Simulation  Test  Set 

In  this  section  we  describe  the  results  of  simulations  to 
analyze  the  performance  of  an  RNS-based  DTW  speech  recognition 
system,  such  as  that  Illustrated  In  figure  3.8.  In  all  simulations, 
the  Itakura-Salto  distortion  metric  was  used.  All  simulations  were 
performed  using  a  limited-size  single-speaker  library  consisting  of 
sixty  utterances  from  an  eleven-word  vocabulary  (the  ten  digits  0  - 
9  and  "oh”).  The  same  set  of  sixty  utterances  was  used  as  the  test 
set.  Scoring  a  success  required  that  both  the  best  and  second  best 


guesses  be  the  correct  text.  Tie  scores,  which  would  result  In 
ambiguity,  were  counted  as  failures.  The  sixty  test  cases  always 
resulted  in  at  least  one  failure  which  was  caused  by  the  setting  of 
the  global  constraint  and  not  by  the  RNS  Implementation  or  the  quan¬ 
tization  of  Itakura-Saito  distortion  values.  A  conventional  imple¬ 
mentation,  running  with  full  range  distortion  values,  would  have 
made  the  same  error,  as  the  global  constraint  eliminated  from 
consideration  all  reference  patterns,  other  than  that  Identical  to 
the  test,  with  same  text  value  In  this  one  case.  For  most  of  the 
simulations  the  RNS  (73,  71,  67,  61,  59),  with  range  approximately 
2  ,  was  employed.  In  the  remaining  parts  of  this  section,  the 

effects  of  Input  scaling,  distortion  function  scaling,  quantization 
threshold,  and  RNS  range  upon  recognition  error  rate  for  this  simu¬ 
lation  are  described. 

3.7.2  Effect  of  Input  Scaling  on  Recognition  Error  Rate 


Figure  3.10  Is  a  plot  of  the  recognition  error  rate  versus  the 
Input  scaling  employed  for  a  fixed  Itakura-Saito  distortion  function 
scaling  by  2  .  (Of  course,  as  the  Input  scaling  changes,  the 

Inverse  correlation  coefficient  scaling  is  also  changed  In  a  comple¬ 
mentary  way  to  keep  the  distortion  scaling  constant.)  Input  scaling 
appears  to  have  little  effect  on  error  rate  until  the  scale  factor 
reaches  2“**.  Test  Input  values  In  the  12-blt  range  probably  should 
not  be  scaled  down  by  more  than  2  . 
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Effect  of  Distortion  Function  Scaling  on  Recognition  Error 
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Figure  3.11  Is  a  plot  of  the  recognition  error  rate  versus  the 
Itakura-Salto  distortion  function  scaling  employed  for  a  fixed  input 
scaling  by  2“^.  (Of  course,  as  the  distortion  function  scaling 
changes,  the  inverse  correlation  coefficient  scaling  is  also  changed 
in  a  complementary  way  to  keep  the  input  scaling  constant.)  Not 
shown  is  the  90  percent  error  rate  obtained  for  a  distortion  func¬ 
tion  scaling  by  2^^.  The  high  error  rates  obtained  for  the  cases 
2  and  2  reflect  insufficient  scaling  of  the  reference  Inverse 
correlation  coefficients;  the  high  error  rate  obtained  for  a  dlstor- 
tlon  function  scaling  by  2  reflects  overflow  of  the  RNS  by  the 
distortion  calculation. 

Another  view  is  presented  in  figure  3.12,  which  shows  the 
recognition  error  rates  obtained  for  various  combinations  of  test 
input  scaling,  distortion  function  scaling,  and  inverse  correlation 
coefficient  scaling  (any  two  of  which  may  be  set  independently). 
Acceptable  performance  was  realized  for  certain  combinations  result- 
Ing  in  a  distortion  function  scaling  of  1  or  2  .  It  is  expected 

that  Improved  performance  would  result  if  the  RNS  range  and  distor¬ 
tion  function  scaling  were  both  increased  together. 

3. 7. A  Effect  of  Quantization  Threshold  on  Recognition  Error  Rate 


Figure  3.13  shows  the  effect  of  the  choice  of  the  quantization 
threshold  employed  for  the  second  quantization,  i.e.,  from  the  range 
of  the  modulus  73  to  a  single  bit,  upon  recognition  error  rate.  For 
these  simulations  the  input  values  were  scaled  by  2  and  the 

2  Q 

distortion  function  was  scaled  by  2  .  A  threshold  of  16  gave  the 


best  performance,  but  over  a  fair  range  the  error  rate  does  not 
appear  to  be  especially  sensitive  to  this  choice.  (The  results 
contained  in  figure  3.12  were  all  obtained  using  a  threshold  of  12.) 

3.7.5  Effect  of  RNS  Range  on  Recognition  Error  Rate 


I 


Figure  3.14  shows  the  effect  of  RNS  range  upon  recognition 
error  rate,  with  a  constant  scaling  of  the  Itakura-Saito  distortion 
function  values  by  2^®.  Test  input  values  were  scaled  by  2~®  and  a 
quantization  threshold  of  16  was  used.  Three  different  residue 
number  systems  were  employed:  (73,  71,  67,  31,  29),  with  a  range  of 
approximately  2^®;  (73,  71,  67,  31,  59),  with  a  range  of  approxi¬ 
mately  2^®;  and  (73,  71,  67,  61,  59),  with  a  range  of  approximately 
2®®.  The  first  RNS  gave  an  error  rate  of  100  percent,  the  second  an 
error  rate  of  about  40  percent,  and  the  third  an  error  rate  of  1.7 
percent.  Clearly,  the  range  is  Important.  The  RNS  range  must  be 
sufficient  to  contain  the  scaled  distortion  function  values  most  of 
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SECTION  4 

SYSTOLIC  ARCHITECTURE  FOR  RNS  IMPLEMENTATION 


This  section  discusses  the  Implementation  of  the  autocorrela¬ 
tion  sample  computation  and  dynamic  time-warping  (DTW)  algorithm. 

For  the  former,  a  linear  systolic  array  is  presented  which  allows 
pipelined  computation  of  the  autocorrelation  coefficients,  while 
accepting  a  continuous  flow  of  input  speech  samples,  without 
requiring  the  insertion  of  zeros  between  adjacent  frames. 

For  the  DTM  algorithm,  a  two-dimensional  pipelined  systolic 
array  of  processing  cells  is  discussed.  Operation  is  pipelined  for 
both  the  distortion  value  and  the  path  metric  computation,  yielding 
the  score  for  a  pair  of  test  and  reference  utterances  at  every  step 
of  the  operation. 

Both  arrays  are  well-suited  for  RNS  implementation,  as  will  be 
discussed. 

4.1  AUTOCORRELATION  COMPUTATION 

Let  X  *  {xj,,  m  >  0|  be  a  set  of  equally  spaced  and  appropri¬ 
ately  windowed  samples  representing  the  speech  signal.  We  consider 
a  finite  portion  of  the  signal  corresponding  to  a  test  utterance  and 
partition  it  into  overlapping  segments  of  M  samples  each  as  shown  in 
figure  4.1.  The  d-th  segment  is  denoted  x(*')  **  {x^^\o  £  m  £M-l}, 
and  the  shift  between  segments  is  denoted  by  A. 
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From  each  segment  a  vector  r(^)  =  (rp^\  •••»  o£  P+1 

autocorrelation  coefficients  given  by 

=  I  54^^4+n  (^-1) 

m=0 

Is  to  be  extracted. 

The  values  of  M  and  A  that  we  Implement  are  180  and  80  samples 
respectively.  With  these  parameters,  at  most  three  segments  overlap 
at  any  time,  and  hence  three  correlators  will  be  needed  if  the 
correlation  vectors  are  to  be  computed  in  a  pipeline. 

Figure  4.2  shows  three  correlators  which  switch  the  Input  data 
stream  on  and  off  precisely  at  the  beginning  and  end  of  their  re¬ 
spective  segments.  Thus,  r(^)  Is  computed  by  correlator  1,  r^^) 
by  correlator  2  and  by  correlator  3.  When  the  first  sample 

(4) 

of  segment  4,  xg  =  X3^,  appears,  all  samples  of  segment  1  have 
entered  correlator  1,  so  It  Is  ready  to  receive  segment  4.  Care  has 
to  be  taken  so  that  In  correlator  1  samples  from  segment  4  do  not 
mix  with  those  of  segment  1.  This  can  be  accomplished  by  the  linear 


4.2  LINEAR  SYSTOLIC  ARRAY  FOR  AUTOCORRELATION  COMPUTATION 


Figure  4.3  shows  a  linear  systolic  array  composed  of  P  +  1  pro¬ 
cessing  cells  and  P  +  1  delay  elements  forming  the  output  register. 
The  Input  samples  enter  Into  the  leftmost  and  rightmost  cells  and 
are  passed  along  to  the  next  cell  to  the  right  and  to  the  left, 
respectively.  At  each  step  every  cell  forms  the  product  of  its  two 
Inputs  and  adds  It  to  Its  contents.  After  all  180  samples  have  been 
operated  on,  the  autocorrelation  coefficients  will  be  In  the  cell 
registers  and  at  that  point  they  can  be  passed  down  to  the  output 
register  and  circulated  out  to  the  right. 

A  detailed  example  with  5  cells  (P  =  4)  is  developed  at  succes¬ 
sive  times  In  figure  4.4.  Input  samples  are  interleaved  with  zeros, 
and  the  left  Input  enters  the  array  first,  from  the  left,  so  that  it 
meets  the  first  sample  from  the  right  Input  at  cell  0.  The  figure 
shows  the  state  of  the  array  at  consecutive  time  instants.  As  the 

signals  progress  through  the  array,  sums  of  products 

(0  ) 

accumulate  in  each  cell,  with  the  computation  of  r^  taking  place 
in  cell  n,  0  <  n  P. 

Figure  4.5  shows  the  last  few  samples  of  a  segment,  followed  by 
zeros.  At  the  end  of  the  process,  the  n-th  autocorrelation  coeffi¬ 
cient  resides  in  cell  n,  0  <  n  <  P,  at  which  point  it  can  be  fed  out 
serially  in  the  manner  already  described,  or  in  parallel  (tech¬ 
nically  violating  the  systoliclty  of  the  operation),  if  needed. 
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In  pipelined  operation,  samples  of  a  new  segment  may  (depending 
on  the  values  of  M  and  A)  enter  the  array  before  the  computation  of 
the  present  segment's  autocorrelation  vector  has  been  completed,  and 
a  cell  may  receive  Inputs  from  both  segments  at  the  same  time,  in 
which  case  they  should  be  Ignored.  This  can  be  taken  care  of  by 
attaching  a  control  bit  to  the  first  sample  of  the  left  input  seg¬ 
ment  and  to  the  first  sample  of  the  right  input  segment.  The  left 
control  bit  appears  at  any  given  cell  when  the  last  term  in  the  sum 
is  being  computed  and  it  Instructs  the  cell  to  Ignore  subsequent 
Inputs,  both  from  the  left  and  from  the  right,  until  the  control  bit 
in  the  first  sample  of  the  new  segment  coming  from  the  right 
appears,  instructing  the  cell  to  output  its  contents,  clear  its 
accumulator  and  resume  operations,  as  the  computation  of  the  corre¬ 
sponding  coefficient  of  the  new  segment  begins. 

Figure  4.6  Illustrates  the  Interplay  of  the  two  control  bits. 
The  old  segment  is  labeled  Xg ,  x^,  ...,  X}^-^,  and  the  new  segment 
yg ,  y|,  ...,  yfi-i .  The  samples  carrying  control  bits  are 
circled.  The  arrows  emerging  from  the  bottom  of  the  cells  indicate 
that  the  result  has  become  available. 

The  equations  governing  the  operation  of  each  cell  are  shown  in 
figure  4.7.  The  left  input  is  indicated  by  a  subscript  1  and  the 
right  input  by  a  subscript  2;  the  time  index  is  n  and  is  shown  in 
parentheses.  The  corresponding  control  bits  are  Cj (n)  and  C2(n). 

The  quantities  u(n)  and  s(n)  are  state  variables.  The  variable  u(n) 
is  used  in  the  operation  of  the  control  bits;  it  is  initially  a  zero 
and  it  changes  its  value  every  time  a  control  bit  appears.  The 
result  is  that  u(n)  =  1  precisely  when  the  cell  is  computing,  and 
u(n)  =  0  between  the  computations  of  consecutive  segments,  when  the 
cell  is  Ignoring  its  Inputs.  The  variable  s(n)  stores  the  partial 
suras  and  r(n)  is  the  output. 


4.2.1 


RNS  Hardware  Concept 


Appendix  A  discusses  the  fundamentals  of  residue  number 
systems. 

Consider  Implementing  equation  (4.1)  In  an  RNS  with  prime 
moduli  Pi  ,  P2 ,  •••,  Pji»  and  range  M.  First,  the  Input  must  be 
converted  to  residue  form.  Then,  for  each  k,  equation  (4.1)  Is 
computed  modulo  p^.  This  requires  one  set  of  correlators  like  the 
one  In  figure  4.2  for  each  modulus.  The  output,  the  residues  of  the 
autocorrelation  coefficients,  are  then  used  for  the  computation  of 
the  local  distortion. 

The  configuration  of  the  correlator  using  I  residue  channels  Is 
shown  In  figure  4.8.  For  each  channel,  a  copy  of  the  control  bits 
Cl  and  C2  must  accompany  the  Inputs. 

Figure  4.9  shows  the  structure  of  each  correlator  cell.  Each 
cell  uses  one  mod  p  multiplier  and  one  mod  p  adder,  six  data  regis¬ 
ters,  a  few  gates  and  switches  and  some  control  logic  which  controls 
the  output  flow. 
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The  detailed  designs  of  the  modulo  p  adders  and  multipliers,  as 
well  as  the  blnary-to-resldue  converter,  were  carried  out  by  MITRE's 
Integrated  Electronics  project  to  Implement  a  transversal  equal¬ 
izer.  They  developed  a  logarithmic  mod  p  (p  Is  prime)  multiplier. 

To  multiply  two  residues  a  and  b  modulo  p,  their  logarithms  base  a 
(where  a  Is  another  fixed  element  of  the  field)  are  computed.  The 
two  quantities  are  added  modulo  p  and  the  Inverse  logarithm  of  the 
result  Is  then  computed.  Symbolically, 


log^jAB  =  log(^A  +  log(jB. 


Figure  4.10  shows  a  block  diagram  of  the  multiplier  that  was 
developed  under  the  Integrated  Electronics  project. 

A  great  savings  Is  realized  If  the  logarithms  of  the  Input 
signals  are  computed  before  they  enter  the  array,  for  then  only  the 
Inverse  logarithm  needs  to  be  computed  In  each  cell.  This  Idea  was 
used  by  MITRE's  Integrated  Electronics  project  to  Implement  a  trans¬ 
versal  filter. 


To  select  the  moduli  we  assume  a  frame  length  of  180  samples, 
an  LPC  model  of  13  poles  and  no  zeros,  and  Input  speech  samples  In 
the  range  [-2°,2  ].  An  upper  bound  on  the  absolute  value  of  the 
size  of  the  autocorrelation  coefficients  Is  then  given  by  180(2®)^, 
or  a  range  of  about  [-2^**  ,2^“*  ] .  The  distortion  values  will  then  lie 
In  the  range  [-2®®, 2®®],  which  is  required  to  contain  the  dot 
product  of  two  feature  vectors.  However,  It  was  determined  by  simu¬ 
lation  that,  for  virtually  all  Inputs  occurring  In  practice,  the 
required  dynamic  range  Is  only  [-2®®, 2®®],  which  Is  spanned  by  the 
five  seven-bit  moduli  103,  107,  109,  113,  and  127. 


4.3  SYSTOLIC  \RRAY  KOR  DTW  COMPUTATIONS 

Once  the  correlation  vectors  for  the  test  segments  have  been 
computed  (and  the  Logarithm  of  the  process  gain,  If  the 
Itakura-Salto  distortion  function  Is  used),  the  DTW  computations  can 
be  carried  out  In  a  pipelined  two-dimensional  processing  array.  It 
is  assumed  that  the  Inverse-autocorrelation  coefficients  (  and 
logarithm  of  the  process  gain)  for  the  reference  segments  are 
available  In  a  stored  reference  library  and  need  no  computation 
during  the  speech  recognition  processing. 

The  DTW  computations,  as  noted  In  section  3,  separate  Into 
computation  of  the  distortion  d^j  between  the  Ith  test  segment  and 
jth  reference  segment  and  the  actual  path  metric  computations  asso¬ 
ciated  with  the  dynamic  programming  algorithm  to  find  the  metric  of 
the  shortest  path. 

The  local  distortion  computation  (for  the  Itakura-Salto  metric) 
Involves  formation  of  a  scalar  product  between  the  test  and  refer- 
■;nce  vectors  of  correlation  and  Inverse  correlation  values  with  the 
addition  of  a  constant  term  dependent  on  the  logarithmic  ratio  of 
the  process  gains,  followed  by  quantization  within  RNS  as  discussed 
In  section  3.  The  dynamic  programming  algorithm  uses  the  quantized 
illstortlon  values  In  a  decision-directed  algorithm  that  preserves 
only  the  best  path  metrics  at  each  Iterative  step.  As  long  as  the 
distortion  values  that  are  needed  In  the  path  metric  computations 
are  computed  In  advance  of  their  need,  both  the  distortion  and  path 
metric  computations  can  be  carried  out  In  the  same  systolic  array. 
Since  the  distortion  values  are  either  absorbed  Into  the  path  metric 
computations  or  discarded  as  they  are  used,  the  computation  can 
progress  through  the  array  as  a  wavefront  leaving  empty  cells  In  Its 
wake  to  provide  a  fully  pipelined  capability. 
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Below  we  describe  the  flow  of  these  computations  In  a  two- 
dimensional  array  of  only  a  few  cells  for  purposes  of  Illustration 
The  architecture  described  Is  readily  extrapolated  to  a  larger 
array. 


4.3.1 


Computation  of  the  Array  of  Distortion  Values 


Once  the  test  utterance  has  been  partitioned  Into  n  overlapping 
segments,  the  Jth  segment  being  represented  by  a  vector  £(j), 
l^j^nof  P+l  autocorrelation  coefficients  of  the  test  segment, 
and  the  reference  utterance  has  been  partitioned  into  m  segments 
represented  by  vectors  ^(1),  l£lj<m,  of  P+l  Inverse-autocor¬ 
relation  coefficients  of  reference  segments,  an  m  x  n  grid  Is  formed 
with  the  distortion  (cost)  dj^j  at  each  grid  point  (i,j)  being  the 
scalar  product  of  £(j)  and  u(i)  plus  a  constant  terra  dependent  on 
the  logarithm  of  the  ratio  of  the  process  gains  for  the  Itakura- 
Salto  distortion  function.  The  DTW  algorithm  computes  the  least 
cost  among  paths  between  grid  points  (1,1)  and  (ra,n),  the  cost  being 
the  sum  of  all  distortions  dj^j  encountered  along  the  path.  For 
the  Ttakura-Salto  distortion,  we  will  concentrate  first  on  the 
pipelined  calculation  of  the  scalar  product  of  the  autocorrelation 
vectors  and  will  show  later  how  to  Incorporate  the  constant  terra  In 
the  architecture. 


After  computation  of  a  distortion  value.  It  must  be  quantized 
to  a  lower  range.  The  details  of  such  a  quantization  were  discussed 
In  section  3.4.3. 


In  order  to  pipeline  the  distortion  computation,  the  correla- 

(Z)  (1) 

tlon  vectors  r  (j)  and  u  (1)  flow  Into  a  rectangular  array  of 
computational  cells  as  shown  Illustratively  In  figures  4.11  and 
4.12(a)  through  4.12(g),  where  a  time  sequence  of  data  flow  Is 
presented  for  a  square  array  of  nine  cells.  Figure  4.13  shows  the 
arrangement  of  the  various  distortion  functions  contained  In  the 
array  at  a  given  Instant  of  time;  the  superscripts  In  both  cases 
refer  to  the  collection  of  a  set  of  segments  associated  with  a 
particular  utterance. 

Since  the  path  computations  for  each  pair  of  test  and  reference 
utterances  will  proceed  as  a  wavefront  making  computations  on 
successive  diagonals,  the  deletion  of  previously  used  distortion 
values  allows  pipelining  to  the  extent  that  distortion  functions 
associated  with  a  number  of  utterances  equal  to  the  number  of 
diagonals  can  be  present  at  any  given  time,  with  the  path  computa¬ 
tions  being  pipelined  along  successive  diagonals. 

In  figures  4.11  through  4.13,  vectors  on  the  same  diagonal 
carry  the  same  superscript  and  correspond  to  the  same  test  or 
reference  utterance.  The  data  proceed  In  straight  lines  (test 
vectors  horizontally  and  reference  vectors  vertically)  through  the 
array,  and  at  each  time  Instant  each  cell  (i,j)  computes 

.  u^^\l)  =  J,  r^O(j)u^n(i) 

n“l 
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(a)  ( *■ ) 

the  scalar  product  of  r^  (j)  and  u  (1),  the  vectors  appearing  at 
Its  Inputs,  to  form  the  principal  term  in  the  Itakura-Saito 
distortion.  In  this  manner,  all  cells  on  any  given  diagonal  perform 
computations  on  the  same  pair  of  test  and  reference  utterances. 

Performing  the  distortion  computation  in  RNS  suggests  a  sepa¬ 
rate  grid  for  each  modulus.  The  Inputs  to  the  kth  array  are  the 
residues  modulo  pj^  of  the  feature  vectors,  u(l),  r(j),  from  which 
the  residues  modulo  pj^  of  the  local  distortions  dj^j  are  compu¬ 
ted.  In  this  operation  the  different  arrays  work  Independently  and 
in  parallel.  The  need  for  quantization,  however,  will  require 
communication  of  the  residue  values  prior  to  path  computations. 

Figure  4.14  is  a  simplified  block  diagram  of  the  computation  of 
the  local  distortion  in  a  residue  channel  modulo  p.  As  the 
components  of  the  test  and  reference  correlation  vectors  enter  the 
multiplier,  the  modulo  p  product  of  corresponding  components 
accumulates.  A  flag  bit  can  be  attached  to  Uq(1)  to  clear  the 
accumulator  at  the  beginning  of  every  new  segment. 

With  our  13-pole  model,  the  computation  of 

12 

u(i)  •  £(j)  =  I  Un(i)  rn(j)  mod  p  (4.2) 

n=0 

requires  13  multiplications  and  12  additions,  which,  if  performed  in 
a  pipeline,  require  13  +  12Tg  seconds,  where  and  Xg 

are  the  times,  in  seconds,  required  for  one  residue  multiplication 
and  one  residue  addition,  respectively. 


4. 3. 1.1  Quantization  of  the  Distortion  Values 

Next,  the  scalar  product  of  u(l)  and  £(j)  Is  quantized  to  a 

smaller  range  spanned  by  three  moduli.  Quantization  operates  on  all 

residues  of  the  scalar  product.  (A  block  diagram  of  this  operation 

Is  given  In  figure  4.15).  At  this  point,  the  constant  terms 
2  2 

logOf  and  logcfg  which  may  be  attached  to  Jt(j)  and  u(l) 
respectively  before  entering  the  array,  may  be  added,  converted  to 
the  three-modulus  RNS  and  added  to  the  quantized  version  of  u(l)  • 
r(j)  to  complete  the  computation  of  d^j^j. 

2 

The  computation  of  -logOf  can  be  performed  outside  RNS  In 

2 

parallel  with  the  computation  of  £(j).  The  term  logOg  can  be 
stored  with  u(l)  In  the  reference  library.  A  method  of  logarithm 
generation  Is  given  In  [10]. 

Quantization  Involves  all  the  residues  of  u(l)  •  £(j),  so 
communication  between  residue  channels  Is  required.  Each  residue  of 
u(l)  •  r(j)  Is  entered  Into  a  table,  so  In  our  five-modulus  RNS 

there  are  five  tables.  The  outputs  of  the  first  two  tables  are 

called  Qj  and  Q2 ,  and  lie  In  [0,pj-l]  and  [0,p2-l],  respectively. 
Each  of  the  other  three  tables  outputs  two  symbols,  aj^^,  b^^,  k  = 

1,  2,  3,  where  a]^  Is  In  [0,pj-l]  and  b]^  Is  In  [0,p,-l].  Then 

the  quantities 

3 

Si  =  Qi  +  I  ak  mod  pj  (4.3) 

k=l 
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and 


3 

“  *^2  I  P2  C^*^) 

k=l 

are  computed.  After  that,  S2  Is  reduced  modulo  and 

t  =  |si  -  Sjl  pj  (4.5) 

Is  computed.  This  quantity  Is  then  entered  Into  a  table  whose 
output  Is  the  quantized  product. 

For  one  quantization,  five  table  lookups  are  first  performed, 
in  parallel,  to  produce  eight  quantities.  These  are  separated  into 
two  groups  of  four  quantities,  and  all  four  quantities  in  each  group 
are  added  by  a  tree  adder.  In  this  manner  s^  and  S2  are  produced. 
Next  I-S2  Ipj  is  computed  from  S2  by  a  table  lookup.  Finally,  an 
addition  is  performed  to  compute  t  and  one  more  table  lookup  is  done 
to  give  the  final  result.  The  time  required  to  perform  one 
quantization  is  then  3t^  +  3Tg  seconds,  where  is  the  time 
required  for  one  table  lookup  and,  as  before,  Xg  is  the  time 
required  for  one  residue  addition. 
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Shortest  Path  Computations 


Once  the  distortion  values  are  available  for  use,  a  dynamic 
programming  algorithm  is  used  to  find  the  shortest,  or  least  distor¬ 
tion,  path  through  the  DTW  grid.  Since  the  distortion  values  are 
quantized  to  a  few  levels,  as  discussed  in  section  3,  the  path  com¬ 
putations  can  be  carried  out  in  RNS  with  path  differences  generally 
small  enough  to  be  contained  within  the  largest  modulus  of  the  RNS, 
occasional  overflows  not  causing  catastrophic  harm. 

To  discuss  a  systolic  architecture  for  these  computations  it 
will  be  convenient  to  temporarily  set  aside  the  pipelining  of  the 
distortion  computations  discussed  in  section  4.3.2.  For  the  present 
discussion,  we  will  assume  that  all  the  distortion  values  for  a 
particular  pair  of  utterances  are  available  in  the  array  and 
describe  the  pipelined  data  flow  of  the  path  computations  in  the 
same  array.  It  will  then  be  evident  that  both  the  distortion  and 
path  computations  can  be  synchronously  pipelined  in  the  same  DTW 
array. 

The  optimality  principle  of  dynamic  programming  states  that  if 
a  path  of  minimal  cost  between  two  points  a  and  c  passes  through 
point  b,  then  the  portion  of  the  path  that  goes  from  point  a  to 
point  b  is  optimal  too,  among  paths  from  a  to  b.  In  accordance  with 
this  principle,  in  the  DTW  algorithm,  cell  (i,j)  computes  the  cost 
of  the  optimal  path  starting  at  (1,1)  and  ending  in  itself.  This  is 
done  Iteratively,  all  cells  along  the  same  diagonal  computing  at  the 
same  time.  Cell  (1,1)  computes  djj^  and  calls  it  minimum 

cost  to  get  to  cell  (1,1).  Next  the  cells  on  the  second  diagonal, 
cells  (2,1)  and  (1,2),  compute  d2|  and  dj2  respectively  and  add  it 
to  Cl  1  to  produce  C21  and  Ci2*  Then  the  cells  on  the  third  diagonal 
do  the  same  thing.  This  time  cell  (2,2)  considers  two  possible 


cells  from  which  a  path  can  emanate,  namely  cells  (2,1)  and  (1,2) 
and  computes  C22  =  <122  +  mln(ci2 >^21 )•  Global  constraints  will 
restrict  the  number  of  cells  to  be  considered,  while  local 
constraints  will  restrict  the  number  of  paths  to  each  cell. 

In  general,  according  to  type  3  local  path  constraints  dis¬ 
cussed  In  section  3,  each  cell  considers  at  most  four  cells  from 
which  a  path  to  It  can  emanate,  l.e.,  cell  (l,j)  computes 

cij  =  dlj  +  'nln(ci_2,  j-i  +  <*1-1,  j.  Ci_2j-2 

+  dl-l,j.  Cl-l,j-l.  ci-l,j-2)-  (^-6) 

For  this  to  be  possible,  the  four  path  costs  must  be  available 
at  the  Input  of  cell  (l,j)  when  the  cells  on  Its  diagonal  are  ready 
to  compute.  This  Is  clearly  possible  since  the  four  path  costs  are 
on  diagonals  that  have  already  completed  computation.  In  the  physi¬ 
cal  array  all  data  move  horizontally  or  vertically,  and  diagonal 
communication — e.g.,  the  communication  of  from  cell 

(1-1, j-1)  to  cell  (l,j) — Is  done  through  cell  (l,j-l).  At  each  step 
on  such  a  path,  the  data  advance  one  diagonal.  All  data  being 
operated  on,  or  computed,  corresponding  to  one  pair  of  utterances, 
lie  on  the  same  diagonal. 

From  this  observation,  it  follows  that  the  distortion  values  on 
a  given  diagonal  need  not  be  available  until  the  computational  wave- 
front  has  reached  that  diagonal  and  this  can  be  managed  by  the 
pipelined  scheme  already  discussed  for  the  distortion  computations. 
Hence,  the  DTW  can  be  completely  pipelined,  with  each  diagonal 
handling  one  pair  of  utterances.  From  cell  (m,n)  the  scores  of  the 
pairs  of  utterances  compared  will  then  emerge  one-by-one. 


Figure  4.16  shows  a  typical  cell  In  the  DTW  grid.  The  computa¬ 
tion  of  c^j  Is  done  In  two  steps  so  that  actually  only  three  quan¬ 
tities  previously  computed  are  fed  to  each  cell  Is  the 

minimum  of  ci-2,j-l  +  ‘^1-1,  j  <^±-2,^-2  + 

Is  not  used  by  cell  (l,j)  but  Is  passed  along  from  cell 
(1-1, j)  to  cell  (l,j+l).  Similarly,  Is  passed  to  the 

cell  above,  after  It  has  been  used  by  cell  (l,j).  Finally,  c^j  Is 
computed  and  passed  to  cell  (1+1, j).  Thus,  In  addition  to  Inputs 
£(j)  and  ^(j),  which  get  passed  along  after  their  scalar  product  Is 
computed,  each  cell  accepts  four  Inputs  and  produces  four  outputs. 

Figure  4.17  shows  a  sequence  of  data  flow  for  the  path 
computation  between  two  utterances  In  a  3-by-3  DTW  array.  For 
purposes  of  Illustration,  the  local  distortion  values  are 

shown  to  exist  throughout  the  array  prior  to  being  used  In  the 
computation,  but  It  Is  clear  that  only  distortion  values  on  the 
diagonal  performing  computations  are  required,  and  hence  that  the 
process  can  be  pipelined. 

Each  cell  receives  four  path  weight  symbols  (two  from  the  cell 
below  and  two  from  the  cell  on  the  left)  and,  likewise,  outputs  four 
symbols.  The  generation  of  these  quantities  Is  shown  In  figure 
4.17.  As  the  computational  front  progresses,  each  computing 
diagonal  has  available  all  the  required  Inputs  and  computes  all 
outputs  required  by  the  next  diagonal. 

The  complete  process,  the  aggregate  of  local  distortion  compu¬ 
tation,  path  computation  and  pipelining.  Is  shown  for  two  adjacent 
pairs  of  utterances  In  figure  4.18. 


The  path  computations  are  carried  out  In  several  residue 
channels,  but  the  local  path  decisions,  which  require  the  computa¬ 
tion  of  two  mlnimums,  are  made  using  only  one  residue  channel.  The 
decisions  are  then  communicated  to  the  other  channels. 

The  required  operations  are 

ci,j  =  dij  +  mln(ci_i^ j-l,Ci_i^ j_2) 

cij  =  mln(ci^j,dij  + 

Each  minimization  Involves  one  residue  addition  and  two  seven-bit 
table  lookups.  Computation  of  then  takes  Zig  + 

seconds.  The  sum  indicated  inside  the  parentheses  in  the  second 
equation  can  be  performed  in  parallel  with  the  computation  of 
c^^j,  so  that  the  computation  of  contributes  Ztg  + 

seconds . 

A  block  diagram  of  the  cell  operations  that  follow  the  compu¬ 
tation  of  dij  is  given  In  figure  4.19.  The  two  dashed  boxes 
contain  the  decision  making  portion  of  the  minimizations,  and  only 
one  set  of  these  is  required  per  cell.  Each  residue  channel 
contains  a  set  of  selectors,  which  receive  the  two  quantities  being 
minimized,  as  well  as  the  decision  signal  which  carries  one  bit  of 
Information. 
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Figure  4.19.  DTW  MOD  P  CELL  ARCHITECTURE 


Interconnection  Concepts 


A  solution  to  the  seemingly  difficult  wiring  problem  required 
by  the  communication  between  the  (typically  50-by-50)  arrays  corre¬ 
sponding  to  the  different  residues  is  to  define  the  basic  cell  so 
that  it  contains  all  the  residues,  as  in  figure  4.20.  The  wiring 
Indicated  in  the  figure  could  be  carried  on  two  levels,  with  all 
horizontal  lines  on  one  level  and  all  vertical  lines  on  the  other 
one.  All  path  computations  could  be  performed  in  the  central,  and 
largest  of  all  the  cells. 

All  lines  indicated  are  one-bit  lines,  with  all  data  being 
transmitted  serially  at  high  speed.  Each  one  of  the  five  data  lines 
on  each  side  of  the  cell  would  be  used  to  shift  in  13  seven-bit 
symbols,  with  the  central  one  shifting  two  additional  seven-bit  path 
cost  symbols.  This  gives  a  total  of  20  data  lines,  and,  on  any 
line,  no  more  than  105  bits  would  have  to  be  transmitted  during  one 
iteration.  Assuming  a  20  MHz  one-bit  shifting  rate  between  one  chip 
to  another,  this  could  be  done  in  about  5  microseconds. 


timate  of  Throughput 


The  diagram  In  figure  4.21  shows  the  DTW  cell  operation  on  a 
time  scale;  the  times  Indicated  are  based  on  the  following  table  of 
estimated  basic  operation  times. 


7-blt 

residue  adder 

Ts 

=  84 

ns 

7-blt 

residue  multiplier 

=  300 

ns 

7-blt 

Input,  14-blt  output 

rt 

=  100 

ns 

table  look-up  (PLA 

1-bit 

shift 

Tb 

=  50 

ns 

The  duration  of  the  cell  operating  cycle  Is  determined  by  the  input 
data  shift,  which  In  turn  is  determined  by  the  Input-output  pin 
limitation.  If  one  cell  is  to  occupy  one  VLSI  IC,  then  two  pins  are 
required  for  VDD  and  ground,  two  pins  for  a  two-phase  clock,  at 
least  20  pins  for  data  and  some  more  for  testing.  A  40-pin  DIP 
could  be  used,  which  would  have  a  few  additional  pins  that  could  be 
used  to  speed  up  the  data  flow.  An  alternative  Is  to  put  nine  cells 
In  a  3-by-3  array  on  an  84-pln  PGA  (Pin  Grid  Array),  In  which  case 
the  minimum  number  of  required  data  lines  would  be  60.  From  figure 
4.21  we  see  that  the  other  operations  can  be  performed  in  137^  and 
17Tg  seconds  or  about  5us,  so  we  conclude  that  one  DTW  output  (one 
comparison  between  a  pair  of  utterances)  would  be  produced  every 
5iis.  For  a  (fast)  speaking  rate  of  ten  utterances  per  second,  a 
20,000  word  library  could  be  scanned  for  every  utterance.  The 
process  could  also  be  speeded  up  by  doing  more  than  one  multiplica¬ 
tion  and  addition  In  parallel  when  forming  the  scalar  product  of 
u(l)  and  r(j). 
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4.4  SUMMARY 


Figure  4.22  depicts  schematically  a  DTW  array  with  pipelined 
Inputs.  The  test  Inputs  are  provided  by  a  bank  of  three  correlators 
as  discussed  previously,  while  the  reference  vectors  are  taken  from 
a  stored  library.  The  commutator  sorts  the  auto_orrelatlon  vectors 
produced  by  the  correlators,  and  the  cells  shown  serve  to  provide 
the  appropriate  skew  to  the  Input  data  of  the  DTW  array.  It  is 
assumed  that  the  logarithms  of  the  process  gains  (computed  sepa¬ 
rately  and  converted  to  RNS  form  using  the  three  moduli  which 
provide  the  range  for  the  DTW  computation)  are  Included  with  the 
correlation  vectors  for  Implementation  of  the  Itakura-Saito  metric. 
Finally,  the  DTW  output  must  be  converted  to  a  weighted  number 
system,  possibly  In  mixed  radix  form  before  the  different  scores  are 
normalized  and  compared. 


SECTION  5 


CONCLUSIONS  AND  RECOMMENDATIONS 


5 . 1  SUMMARY 


In  the  previous  paragraphs  we  have  reported  on  our  work  over 
the  preceding  year  to  demonstrate  the  effectiveness  of  the  combina¬ 
tion  of  RNS  computations  and  systolic  array  architectures  for  the 
Improved  implementation  of  speech  recognition  algorithms,  improve¬ 
ment  being  assessed  with  regard  to  throughput  and  complexity  of 
hardware  implementation.  From  our  work  we  conclude  that  RNS  does 
indeed  have  utility  for  implementing  a  dynamic  time-warping  word 
recognition  algorithm  driven  by  autoregressive  spectral  analysis  and 
the  Itakura-Salto  distortion  computation.  We  showed  how  the  combi¬ 
nation  of  RNS  and  systolic  arrays  could  be  used  to  effectively 
implement : 

The  autocorrelation  estimates  used  in  autoregressive 
(or  LPC)  spectral  analysis; 

Computation  of  the  Itakura-Saito  distortion  values; 

Dynamic  time-warping  least-cost  path-metric  computa¬ 
tions  . 

The  incorporation  of  the  distortion  computations  with  the  path- 
metric  computations  in  a  pipelined  two-dimensional  systolic  array 
was  shown  to  have  a  favorable  impact  in  providing  a  high  throughput 
for  the  most  computationally  intensive  part  of  the  recognition 
algorithm. 


These  Investigations  were  supported  by  the  development  of  a 
computer  simulation  which  was  used  extensively  for  experimentation 
with  an  eleven-word  vocabulary,  three  to  six  different  productions 
of  each  word  being  stored  in  a  reference  library  to  accommodate 
variations  in  utterance  of  the  same  word.  From  our  experiments,  we 
developed  a  means  of  reducing  the  dynamic  range  of  the  path-metric 
computations  by  quantization  of  the  distortion  values  within  RNS.  A 
good  means  of  narrowing  the  range  of  distortion  values  seems 
critical  to  successful  RNS  implementation. 

A  number  of  different  distortion  functions  have  been  studied 
and  reported  in  the  speech  processing  literature;  many  of  them  are 
variants  of  the  Itakura-Salto  metric,  and  thus  based  on  LPC  analysis 
[19J.  Experimental  results  have  largely  shown,  in  our  opinion,  that 
different  metrics  of  this  sort  are  roughly  equivalent,  although  a 
ranking  based  on  small  Improvements  could  be  contemplated  [3].  Much 
of  the  interest  in  LPC  stems  from  its  success  in  modeling  the  speech 
production  process  in  a  noise-free  environment,  but  it  is  known  that 
the  all-zero  linear  prediction  model  breaks  down  in  the  presence  of 
additive  noise  [19]. 

From  our  point  of  view,  not  only  must  the  distortion  measure 
produce  satisfactory  discrimination  in  a  narrow  range  of  values,  it 
must  also  be  suitable  for  RNS  comp\itations.  For  speech  recognition, 
the  purpose  of  the  spectral  analysis  and  distortion  computation  is 
to  distill  the  Information  contained  in  the  speech  waveform  into  a 
small  set  of  data  suitable  for  low-error  discrimination  between 
distinct  word  patterns,  and  not  to  preserve  information  needed  for 
high-grade  speech  synthesis.  Thus,  the  LPC  methods  may  be  unneces¬ 
sarily  stringent  for  speech  recognition  purposes,  while  weak  in  the 
presence  of  noise,  and  at  the  same  time  imposing  the  need  for  high- 
precision,  and  therefore  large  dynamic-range,  computations. 


While  we  remain  convinced  by  the  results  of  our  work  that  RNS 
implementation  has  high  potential  for  speech  recognition,  we  are 
also  convinced  that  substantial  Improvement  in  reducing  the  gross 
dynamic  range  of  the  distortion  computations,  without  sacrificing 
discrimination  ability,  is  required  before  RNS  implementation  will 
be  accepted  as  a  truly  practical  or  attractive  alternative  to 
conventional  architectures  that  use  floating-point  computation. 

This  will  require  a  re-assessment  of  the  distortion  function  used  to 
support  the  DTW  computations.  Some  alternative  distortion  function 
computations  that  could  be  considered  in  a  follow-on  effort  are 
discussed  briefly  below. 

5.2  ALTERNATIVES  FOR  DISTORTION  COMPUTATION 


SSSiSfl 
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In  our  work,  we  chose  the  LPC  analysis  and  Itakura-Saito 
distortion  for  the  reasons  discussed  in  section  1,  but  perceive  the 
need  to  experiment  with  other  distortion  functions  in  any  follow-on 
effort.  It  is  particularly  perplexing  that  the  path-metric 
computations  can  be  successfully  carried  out  with  very  coarse 
quantization  of  the  distortion  values,  while  our  Implementation  of 
the  Itakura-Saito  metric  shows  a  need  to  maintain  a  much  larger 
range  for  its  computation. 


• 


5.2.1 


Itakura  Distortion 


One  variation  of  the  Itakura-Salto  distortion  function,  some¬ 
times  called  the  gain-optimized  Itakura-Salto  distortion,  was  origi¬ 
nally  recommended  by  Itakura  [21].  Its  distinguishing  feature  Is 
the  use  of  a  logarithmic  function  of  the  ratio  of  the  linear  predic¬ 
tion  residuals.  It  provides  a  measure  that  Is  roughly  proportional 
to  the  logarithm  of  the  Itakura-Salto  (or  maximum  likelihood) 
distortion.  As  such.  It  has  the  desirable  attribute  of  compressing 
the  range  of  distortion  values.  Hardware  Implementation  of  this 
method  has  been  reported  In  the  literature  In  which  the  logarithmic 
distortion  values  are  confined  to  the  range  of  a  few  bits  and  In 
which  the  path  metric  computations  are  carried  out  with  16-blt 
fixed-point  arithmetic  [10] •  The  key  to  that  Implementation  Is  the 
simplicity  of  a  logarithmic  quantizer  based  on  a  geometric  series 
representation  of  the  logarithm  in  a  form  that  Is  amenable  to  hard¬ 
ware  Implementation  (with  roughly  200  logic  gates).  The  referenced 
work  points  out  the  feasibility  of  compressing  the  range  of  the 
Itakura-Salto  distortion  values  (computed  with  24-bit  precision), 
but  it  Is  not  directly  of  value  to  our  RNS  implementation.  Since  we 
would  like  to  pipeline  the  distortion  computations  In  the  same  RNS 
systolic  array  that  Is  used  for  the  path-metric  computations,  we 
would  need  to  Invent  a  logarithmic  converter  in  RNS,  and  that  Is  not 
a  likely  prospect. 


5.2.2  Euclidean  Distance  Measures 

Squared  Euclidean  distance,  or  equivalently  mean-squared  error, 
while  traditional  In  signal  processing  work,  has  tended  to  be 
rejected  In  recent  speech  research  on  the  subjective  grounds  that  It 
Is  not  sufficiently  meaningful  for  representing  what  are  thought  to 
be  the  requirements  of  auditory  perception.  It  Is  pointed  out  that 
the  ear  needs  only  to  recognize  the  random  process  producing  the 
waveform  to  within  some  accuracy  and  does  not  need  to  accurately 
reproduce  the  specific  waveform,  and  that  demanding  a  small 
mean-squared  error  In  a  speech  system  will  often  require  far  more 
bits  and  accuracy  than  the  human  ear  requires  [21].  The  success  of 
the  LPC  methods  can  be  attributed  to  the  corresponding  distortion 
measures  (such  as  Itakura-Salto's)  that  measure  In  a  probabilistic 
sense  the  closeness  of  the  original  and  reproduced  processes  or 
models  rather  than  the  actual  waveforms. 

Mean-squared  error,  however,  cannot  be  rejected  on  the  grounds 
that  It  Is  too  forgiving.  If  we  base  our  analysis  on  power  spectra, 
or  equivalently  autocorrelation  analysis,  then  Euclidean  distance 
may  still  be  useful  to  discriminate  spectral  patterns  for  purposes 
of  automatic  speech  recognition.  For  RNS  Implementation,  we  prefer 
squared  Euclidean  distance  since  It  avoids  the  need  for  explicit 
sign  detection  which  would  suggest  leaving  RNS  for  that  purpose. 

The  utility  of  this  measure  will  then  depend  on  our  ability  to 
contain  the  range  of  values  to  a  practical  size.  Two  approaches 
that  use  squared  Euclidean  distance  will  be  discussed  briefly. 


5. 2. 2.1  Log  Spectral  Deviation 


One  of  the  oldest  distortion  measures  proposed  for  speech  Is 
formed  by  the  Lp  norm  of  the  difference  of  the  logarithms  of  the 
power  spectra.  Assuming  the  spectral  envelopes  have  been  sampled 
and  scaled  logarithmically,  the  L2  norm  Is  simply  the  squared  Eucli¬ 
dean  distance  between  the  vectors  of  logarithmic  spectral  samples. 
One  of  the  traditional  ways  of  providing  the  spectral  envelopes  Is 
by  means  of  a  bank  of  constant-Q  filters  appropriately  spaced  across 
the  speech  spectrum,  the  output  of  each  filter's  power  being  sampled 
In  time  and  scaled  logarithmically  [22].  Such  a  filter  bank  Is 
probably  best  Implemented  with  analog-sampled  swltched-capacltor 
active  filters  rather  than  In  the  form  of  digital  filters;  thus  we 
would  not  propose  to  use  RNS  for  the  filter  bank,  but  would  convert 
the  log-spectral  samples  to  RNS  code  for  computation  of  the  squared 
Euclidean  distortion  In  a  systolic  array  of  the  type  described  In 
section  4. 

Another  means  of  performing  the  filter-bank  analysis  would  be 
to  perform  a  discrete  Fourier  transform  (DFT)  of  the  windowed  speech 
samples  with  a  moderately  high  resolution  (perhaps  256  to  1024 
samples  per  frame),  subdivide  the  samples  Into  appropriate  bands 
over  which  the  complex  DFT  samples  are  to  be  squared  and  summed, 
compute  the  power  In  each  sub-band  and  convert  to  logarithmic  form. 
With  the  exception  of  logarithmic  conversion,  all  of  the  processing 
could  be  carried  out  with  RNS.  Logarithmic  conversion  would  require 
reconversion  to  a  weighted  number  system  (which  would  probably  be 
needed  for  spectral  normalization  anyway)  followed  by  reconversion 
to  RNS  for  the  distortion  computation.  These  processing  steps  would 
clearly  be  more  complicated  than  the  computation  of  the  autocorrela¬ 
tion  samples  described  in  section  4,  but  the  operations  would  need 


to  be  performed  only  once  for  each  test  segment  and  the  method 
should  be  worthy  of  consideration  for  future  effort. 

5. 2. 2. 2  Direct  Autocorrelation  Analysis 


Although  the  autocorrelation  coefficients  of  the  windowed 
speech  samples  could  be  transformed  by  DFT  to  provide  a  representa¬ 
tion  of  the  spectral  envelope,  It  may  be  better  to  use  them  directly 
for  spectral  discrimination.  As  shown  In  section  2,  all  of  the 
processing  for  estimating  LPC  coefficients  and  computing  the 
Itakura-Salto  distortion  Is  carried  out  In  the  time  domain,  without 
any  specific  need  for  the  frequency  spectra.  While  LPC  analysis  Is 
used  to  approximate  the  spectral  envelope,  as  represented  by  the 
all-pole  linear  filter  model.  It  Is  essentially  a  linear  transforma¬ 
tion  of  a  subset  of  autocorrelation  coefficients.  The  assumption  of 
an  all-pole  model  of  the  speech  production  process  allows  this  sub¬ 
set  of  autocorrelation  values  to  accurately  approximate  the  remain¬ 
ing  ones  In  accordance  with  the  autoregressive  nature  of  the  model. 
This  Is  the  basis  for  obtaining  a  good  spectral  approximation. 


If  the  LPC  model  Is  adequate,  then  for  the  purpose  of  automatic 
speech  recognition  It  may  be  satisfactory  to  employ  the  subset  of 
autocorrelation  coefficients  used  In  LPC  analysis  directly  In  a 
squared  Euclidean  distance  computation.  For  RNS  Implementation, 
since  It  would  be  appropriate  to  work  with  normalized  correlation 
coefficients.  It  would  be  necessary  to  leave  RNS  to  carry  out  the 
normalization  and  then  reconvert  to  RNS  for  the  distance  computa¬ 
tions.  But,  such  a  technique  for  distortion  computation  might 
actually  be  simpler  to  Implement  than  the  Itakura-Salto  distortion 
and  there  would  be  no  need  to  solve  the  normal  equations  for 


construction  of  the  reference  library.  Also,  If  the  number  of 
samples  used  Is  extended  beyond  the  LPC  subset,  then  the  dependence 
on  the  assumption  of  an  autoregressive  model  Is  diminished. 

5.3  FOLLOW-ON  RECOMMENDATIONS 

In  any  follow-on  effort  we  would  propose  to  experimentally 
study  the  feasibility  of  using  a  squared-Euclidean  metric,  rather 
than  the  Itakura-Salto  distortion,  for  RNS  implementation  of  the 
distortion  computations.  The  spectral  envelope  and  direct  auto¬ 
correlation  methods  discussed  above  would  be  a  starting  point  for 
such  Investigations.  A  critical  Issue  Is  the  dynamic  range  Imposed 
by  the  square-law  measure,  particularly  when  comparing  dissimilar 
words.  We  must  recognize,  however,  that  occasional  overflow  of  the 
computation  range,  in  which  RNS  converts  the  statistical  outliers  to 
values  lying  In  a  valid  range.  Is  probably  tolerated  in  the  path 
metric  computations,  particularly  since  these  events  are  more  likely 
to  occur  while  comparing  dissimilar  words  whose  best  DTW  path 
metrics  should  be  relatively  large  in  any  case.  It  would  be  quite 
surprising  If  the  overflows  associated  with  bad  word  matches  were  to 
conspire  to  produce  a  path  metric  smaller  than  that  of  a  proper 
match. 

If  either  of  the  methods  discussed  were  to  prove  viable  in 
reducing  the  range  of  distortion  values  significantly,  then  it  would 
be  appropriate  to  contemplate  actual  VLSI  hardware  implementation  of 
the  simplified  DTW  systolic  array,  requiring  an  Increased  attention 
to  the  details  of  combinational  logic  design  of  the  cells  of  the 
array  In  a  follow-on  effort. 
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APPEND  LX  A 


RESIDUE  NUMBER  SYSTEMS 

An  RNS  with  range  M  =  P1P2  •••?£,»  where  each  p]^  Is  a  prime 
Integer,  called  a  modulus,  Is  a  number  system  represented  by  the 
Integers  In  the  Interval  [0,M-1]  In  which  addition  and 
multiplication  of  two  elements  x  and  y  Is  defined  as  the  remainder 
on  division  by  M  of  their  sura  and  product,  respectively.  Under 
these  operations,  the  usual  sum  or  product  of  x  and  y  Is  obtained, 
provided  that  the  result  lies  In  the  same  range  [0,  M  -  1].  The 
main  advantage  afforded  by  RNS  Is  that  the  complexity  of  the 
operations  (measured  by  the  size  of  a  table  required  to  Implement 
the  operations,  for  Instance)  is  lower  than  that  of  the 
corresponding  Integer  operations  because  of  the  ability  to  decompose 
the  processor  into  a  set  of  Independent  processors  each  performing 
operations  modulo  p^^  [23]. 

In  RNS  an  Integer  x  In  the  interval  [0,M-1]  is  represented  by 
Its  residues  modulo  the  p}^,  l.e.. 


x  =  (xi  ,  X2  ,  ,  Xj^)  (A-1) 

where  xj^  equals  the  remainder  obtained  when  x  Is  divided  by 
Pl^;lt  follows  that,  for  each  k,  0  <  <  pj^.  Every  number 

between  0  and  M  -  1,  Inclusively,  has  a  unique  RNS  representation, 
and  every  l-vector  (xj ,  x^ ,  ...,  x^)  with  0  Xj^  <  P]^^  corre¬ 

sponds  to  a  unique  Integer  x  with  0  ^  x  <  M.  Furthermore,  the  sum 
(product)  of  two  Integers  x  and  y  in  the  RNS  can  be  obtained  by 
adding  (multiplying)  corresponding  components  x^  and  yj^,  modulo 
the  corresponding  pi^. 


In  this  manner,  any  operation  requiring  only  addition  and 
multiplication,  the  Input  and  output  ranges  of  which  are  known,  can 
be  Implemented  In  RNS.  Since  Intermediate  overflow  is  harmless, 
provided  that  the  final  result  Is  contained  in  I0,M-1],  It  follows 
that  the  range  of  the  system  Is  solely  determined  by  the  range  of 
the  output. 

Figure  A-1  Illustrates  the  architecture  of  a  general  RNS 
processor.  After  the  Input  has  been  converted  to  RNS  form,  the 
function  Is  Implemented  by  an  Independent  set  of  processors,  each 
operating  In  a  different  residue  channel.  Usually,  the  RNS  output 
Is  converted  to  a  weighted  number  system,  but  linear  operations 
could  be  continued  In  RNS  representation. 


A  formula  relating  a  positive  Integer  x  e  [0,  M-1]  to  its  RNS 


representation  (xj ,  X2 ,  ...>  X£),  Is  given  by 


X 


mod  M 

1  =  1 


(A-2) 


where  =  ’1/pi  and  Is  the  Inverse  of  modulo  p£, 
l.e.,  an  Integer  M^^ln  the  range  [0,  p^-l]  such  that 


1  mod  p£ 


(A-3) 
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For  example,  If  pj  =  3,  p2  =  5,  and  P3  =  7,  then  =  35, 

M2  =  21,  M3  =  15,  Mj  ^  =  2,  M2^  =  1,  and  M3^  =  1.  Now,  If  x  =  34, 


then  (xj ,  X2 ,  X3)  =  (1,  4,  6)  and  the  formula  gives 


x=l*  35*  2+4*  21  •  l+6»  15  •  1  mod  105 
=  244  mod  105 

=  34  mod  105. 


(A-4) 


This  method  of  reconstruction,  illustrated  by  (A-4),  which  is 
an  example  of  the  Chinese  Remainder  Theorem,  is  useful  for 
theoretical  purposes.  A  more  efficient  method  for  obtaining  a 
weighted  number  representation  of  an  RNS-coded  number  is  to  convert 
to  a  mixed-radix  representation  using  the  moduli  of  the  RNS  as 
radices . 


The  mixed-radix  representation  of  an  integer  x  in  [0,  M-1]  is 
given  by 


I 
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i-l 
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Ci  1  Pj 


(A-5) 


wh**re  fh*‘  ■(,  rallfri  ftie  mixed-radix  coefficients,  satisfy 

'  '  j  [>(  .  In  I-. HI. If  t  nn  (  X-^)  , 
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For  a  three-moduli  RNS,  equation  (A-5)  can  be  written  as 


X  =  C3P2P1  +  C2P1  +  Cj.  (A-7) 

Note  that  from  equation  (A-3)  we  have 


cj  =  X  mod  pj 

C2  =  (x-Cj)/pi  mod  P2 

C3  =  (x  -  (x-Cj )/p^)/p2  mod  P3 


(A-8) 


Equation  (A-8)  can  be  generalized  to  obtain  the  mixed-radix  coeffi¬ 
cients  of  a  given  number  in  an  RNS  with  an  arbitrary  number  of 
moduli.  A  refinement  of  the  resulting  equations  yields 


C3  =  Xj  mod  pj 

C2  -  Pi^(x2  “  C3)  mod  p2 

C3  -  P2^(p/(*3  -  Cl)  "  C2)  mod  P3 


(A-9) 
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The  coefficients  cj  are  derived  iteratively  by  starting  with 
Cl  =  Xi  and  computing  c^  by  subtracting  cj-i  from  the  result  of 
the  previous  iteration  and  multiplying  the  difference  by  p^ii,  all 
modulo  pj . 


Finally,  the  evaluation  of  equation  (A-5)  can  be  done  recur¬ 
sively  using  Horner's  scheme: 

X  =  (...  +  c^_^)pj_2  +  C|j_2)Pj^_3  ...  +  C2)Pi  +  Cj  (A-10) 

Efficient  VLSI  hardware  structures  that  perform  pipelined 
blnary-to-resldue  and  resldue-to-blnary  conversion  for  a 
five-modulus  RNS  have  been  designed  by  MITRE's  Integrated 
Electronics  project  for  the  primes  (31,29,23,19,17)  representable 
with  five  bits. 
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